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1.0  INTRODUCTION 


As  part  of  the  Air  Force  Base  and  Installation  Security  System 
(BISS)  program,  Rome  Air  Development  Center  (RADC)  has  sponsored  this 
contract,  F3G6Q2-79-C-01 76,  entitled  "Data  Collection,  Analysis  and 
Test"  (Data  CAT).  The  purpose  of  the  project  was  to  specify  a  data 

base,  and  its  method  of  collection  to  be  used  in  testing  of  present 

and  future  voice,  fingerprint  ar  '  signature  authentication  devices. 
This  report  is  the  final  summary  of  the  results  of  that  effort. 

Entry  control  and  the  associated  concept  of  personal  identity 
authentication  have  long  been  of  interest  to  RADC,  and  are  integral 

parts  of  the  BISS  program.  A  large  portion  of  the  effort  is  devoted 

to  the  acquisition  of  automated  entry  control  systems  to  provide  all 
levels  of  security.  The  diverse  requirements  of  varying  applications 
and  levels  of  security  make  for  a  multiplicity  of  devices  and  system 
configurations,  all  of  which  require  testing  and  evaluation.  The  test 
procedures  are  expensive  ana  often  inconsistent  and  inadequate.  Data 
CAT  is  designed  to  reduce  these  piohlems.  Specifically,  according 
to  the  Statement  of  Work,  "The  objective  of  this  study  is  to  determine 
an  experimental  procedure  for  the  collection  of  data  bases  to  be  used 
in  testing  and  evaluation  of  present  and  future  voice,  fingerprint, 
and  signature  authentication  techniques." 
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A  major  cost  in  entry  control  device  testing  is  the  collection  of 
adequate  data  from  test  subjects..  With  Data  CAT,  this  need  be  done 
only  once,  in  order  to  generate  the  data  base.  Subsequent  testing  is 
performed  by  reproducing  the  appropriate  attribute  from  the  data. 
Since  the  same  procedure  is  followed  for  every  test,  results  should  be 
consistent  and  comparable.  Furthermore,  proper  design  of  the  data 
base  will  ensure  adequate  testing. 

There  were  four  major  issues  to  be  resolved  by  this  effort. 
First,  how  much  data  is  required  in  the  data  base?  For  any  binary 
decision  making  device,  there  are  two  types  of  errors:  False 
rejection  and  false  acceptance.  These  have  been  given  the  names  Type 
I  and  Type  II  errors,  respectively.  We  wish  to  know  how  much  data  is 
required  to  determine  the  Type  I  and  Type  II  error  rates  to  a  given 
confidence.  Naturally,  we  wish  to  determine  the  minimum  amount  of 
data  required,  since  the  cost  of  collection  and  storage  increases  with 
the  quantity  of  data.  This  issue  speaks  to  the  question  of  the 
adequacy  of  the  testing  and  points  out  one  reason  why  other  procedures 
were  inadequate.  Because  of  a  lack  of  understanding  of  the  statistics 
of  the  problem,  or  to  cut  costs,  inadequate  quantities  of  test  data 
were  collected.  We  have  made  our  determination  of  data  quantity  based 
on  a  thorough  statistical  study  of  the  problem. 

Specifically,  we  have  determined  the  minimum  total  number  of  test 
samples  and  the  minimum  total  number  of  individuals  required  to 
determine  a  Type  I  error  rate  of  IX  with  90X  and  95X  confidence,  and  a 
Type  II  error  rate  of  2%  and  .001%  with  90X  and  95X  confidence.  We 
have  also  made  an  estimate  of  the  number  of  samples  required  for 
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enrollment,  and  the  number  of  different  sessions  required  to  collect 
the  data. 

Second,  what  information  should  constitute  the  data  base  for  each 
attribute?  The  answer  is,  of  course,  all  the  information  required  to 
reproduce  the  attribute.  This  answer  is  useful  though,  only  in 
pointing  the  way  to  the  real  resolution  of  the  problem  and  indeed,  to 
the  key  risk  area  of  this  effort:  How  to  reproduce  the  attribute. 
For  example,  it  is  not  exactly  obvious  how  one  should  store  and 
reproduce  a  fingerprint.  Optical  projection  of  the  image  is  not 
adequate  because  at  least  one  known  system  requires  actual  physical 
contact  of  the  fingerprint  ridge  on  the  input  sensor  C65D.  The 
presence  of  the  ridge  changes  the  index  of  refraction  at  the  boundary 
and  it  is  this  change  which  is  detected.  It  has  been  suggested  that 
the  input  sensor  could  be  bypassed  and  its  output  to  the  analysis 
stage  could  be  simulated.  This  is  not  acceptable  since  the  sensor  is 
such  an  important  part  of  the  device;  its  performance  must  be 
evaluated  also.  We  propose  a  procedure  which  surmounts  all  these 
obstacles. 

In  the  case  of  the  voice  data  base,  the  difficulty  is  not  in  the 
physical  reproduction  of  the  attribute  -  that  can  be  handled  by  an 
amplifier  and  loudspeaker  -  the  difficulty  is  in  constructing  the 
utterance  to  be  reproduced.  The  data  base  must  have  universal 
applicability  which  for  voices  means  that  the  data  must  be  capable  of 
reproducing  an  arbitrary  utterance.  Voice  verification  devices  employ 
a  large  variety  of  utterances  for  verification  and  it  is  not  possible 
to  determine  a  priori  which  utterances  will  be  required.  This  fact 
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dictates  that  some  form  of  speech  synthesis  is  necessary  to  reproduce 
the  speech  data  base.  Not  only  must  the  utterance  be  synthesized  on 
some  fundamental  level/  but  it  also  must  be  recognizably  distinct  for 
each  subject  in  the  data  base.  This  requirement  is  indeed  a  stringent 
one. 


It  is  clear/  then/  that  the  method  or  procedure  used  to  reproduce 
the  attribute  will  determine  the  information  to  be  stored  in  the  data 
base. 


Third/  how  is  the  data  base  to  be  stored?  The  resolution  of  this 
issue  is  dictated  by  the  nature  of  the  information  to  be  stored.  For 
instance/  analog  speech  data  should  be  stored  on  analog  magnetic  tape. 
In  general,  the  quantity  of  data  will  be  fairly  large  so  that  some 
form  of  archival  "off-line"  type  of  storage  would  seem  appropriate. 
When  time  comes  to  test  a  device/  the  data  could  be  brought  "on-line" 
to  some  convenient  form.  Consider,  for  example  digital  speech  data. 
The  volume  of  data  is  so  large  that  it  would  not  be  economical  to  keep 
in  core  memory  or  even  on-line  on  disk.  Digital  magnetic  tape  would 
be  most  appropriate.  For  device  testing,  the  dat8  would  be  easily 
transferred  from  tape  to  disk,  or  even  read  from  the  tape  directly,  if 
random  access  is  not  required. 
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Finally,  how  should  the  data  be  collected?  One  would  like  to 
collect  the  data  in  a  way  that  assures  its  accuracy  in  representing 
the  population.  To  do  this,  one  must  first  determine  the  population 
to  be  sampled,  then  where  to  find  the  subjects,  then  finally,  how  to 
ensure  the  cooperation  of  the  subjects  in  obtaining  accurate  data. 

Before  pursuing  the  issues  at  hand  any  further,  a  few  general 
remarks  about  our  approach  to  the  design  of  the  data  collection  system 
are  in  order.  Ideally,  we  would  like  the  collection  hardware  to  be 
small,  portable  and  inexpensive,  as  we  anticipate  collecting  data  from 
locales  across  the  nation.  Processing  and  reproduction  equipment  is 
not  so  constrained,  so  long  as  the  data  can  be  recorded  and  brought  to 
a  central  facility.  Our  system  will  require  a  minimum  of  special 
purpose  hardware,  and  will  be  general  enough  to  facilitate  expansion 
and  modification. 
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2.0  QUANTITY  OF  DATA 


The  first  issue  addressed  was  that  of  determining  the  amount  of 
data  required  for  Type  I  and  Type  II  error  testing.  Recall  that  a 
Type  I  error  is  a  false  rejection  and  that  a  Type  II  error  is  a  false 
acceptance.  Derivations  of  the  results  presented  in  this  Section 
appear  in  Apendix  A.  Consider  first  Type  I  errors. 


2.1  TYPE  I  ERROR  TESTING 

Ue  would  like  to  know  the  minimum  total  number  of  samples 
required  to  determine  a  Type  I  error  rate  of  p  *  .01,  or  IX  with  90X 
and  9SX  confidence.  First  note  that  confidences  are  only  defined  on 
intervals  about  some  value.  Accordingly,  we  define  an  interval  of 
+  .005  or  +_  0.5X,  about  p  *  IX  which  allows  a  distinction  to  be  made 
between  IX  and  2X.  We  find  then  that  1200  test  samples  will  suffice 
to  determine  p  *  1.C  +  0.5X  with  90X  confidence  and  1800  test  samples 
gives  us  95X  confidence  in  our  result.  The  arguments  leading  to  these 
results  are  interesting  because  they  apply  to  any  binary  decision  with 
a  fixed,  constant  probability. 

To  find  the  minimum  total  number  of  test  subjects,  we  first 
establish  that  the  performance  specification  p  =1.0  +.  0.5X  is  the 
average  system  performance,  not  inividual  average  performance.  Then 
assuming  the  existence  of  an  undisclosed,  poorly  performing  subgroup. 
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we  find  that  at  least  AOO  subjects  must  be  included  in  the  data  base 
to  insure  that  this  subgroup  does  not  unduly  affect  the  results.  This 
notion  of  subgroups  of  the  population  is  an  important  one  and  will 
affect  the  design  of  the  data  bases. 

Combining  the  number  of  samples  and  individuals  tells  us  that 

each  subject  must  give  at  least  three  to  five  samples  for  the  test 

data  base  for  Type  I  testing. 

2.2  TYPE  II  ERROR  TESTING 

Now  let  us  consider  Type  II  errors.  Like  Type  I  errors,  we  wish 
to  determine  .he  minimum  total  number  of  samples  and  subjects 

required.  The  Type  II  error  rates  of  interest  are  Pi  *  0.02  or  2%  and 

P2  =  1  x  10~5  or  .001X.  Using  the  statistics  developed  for  Type  I 
errors,  we  first  define  the  intervals  about  p^  and  p2  to  be  +  .01  and 
+.  0.5  x  10"5,  respectively.  We  find  that  800  samples  will  determine 

Pi  s  *02  +  .01  with  90X  confidence  and  1 0C0  samples  gives  us  95% 
confidence.  The  values  for  P2=  1  x  10  +_  0.5  x  10  are  1.2  x  10 

for  90%  confidence  and  approximately  1.8  x  1G6  for  95X.  These  are  the 
minimum  total  number  of  tests  required  to  determine  that  the 
performance  meets  the  specifications. 


7 


Using  again  the  notion  of  undisclosed  subgroups,  we  find  that  200 
account/ intruder  pairs  are  required  for  p  *  2X  and  399,996  pairs  for 
p  *  .001%.  The  population  of  enrolled  subjects  for  Type  I  testing  pan 
be  paired  for  Type  II  testing.  With  the  restriction  that  the  account 
and  intruder  populations  cannot  overlap,  approximately  21  enrolled 
subjects  will  form  sufficient  number  of  pairs  for  Type  II  error  of  2X 
and  895  for  .001X  error. 


2.3  NUMBER  OF  ENROLLMENT  SAMPLES 

Verification  devices  require  the  subjects  to  first  enroll  on  the 
system,  so  enrollment  samples  must  be  included  in  the  data  base.  In 
keeping  with  good  practice  C13,  the  enrollment  samples  should  be 
separate  from  the  test  samples.  How  many  additional  samples  should  be 
collected  from  each  subject  for  enrollment?  In  general,  the  answer  to 
this  question  depends  on  the  dimensionality  of  the  feature  space  and 
the  complexity  of  the  decision  boundary,  neither  of  which  are  known  a 
priori.  An  analytic  solution  is  therefore  not  possible,  but  it  is 
possible  to  make  a  reasonable  guess  based  on  current  devices.  Twenty 
samples  per  subject  turns  out  to  be  a  good,  conservative  figure  and 
indeed,  it  would  seem  unlikely  that  more  than  twenty  samples  might  be 
required  since  an  entry  control  device  requiring  too  large  a  number  of 
enrollment  samples  would  prove  inconvenient  to  its  users. 


8 


2.4  NUMBER  OF  DATA  COLLECTION  SESSIONS 


Finally/  it  is  well  known  that  there  are  certain  long-term 
variations  in  the  attributes  under  consideration.  How  many  sessions 
are  required  to  account  for  these  variations?  To  answer  this,  assume 
again  that  an  undisclosed/  poorly  performing  subgroup  has  emerged  as  a 
result  of  the  long-term  variations.  We  can  then  apply  our  previous 
arguments  to  show  that  a  minimum  of  400  collection  sessions  are 
required  to  ensure  against  the  effects  of  this  subgroup.  However/  if 
we  further  assume  that  the  temporal  variations  are  not  correlated 
between  subjects/  the  results  for  400  sessions  can  be  inferred  from 
the  results  for  400  subjects  in  one  session.  Therefore/  two  data 
collection  sessions  are  required;  one  to  collect  enrollment  samples, 
and  one  to  collect  test  samples.  From  a  study  of  the  long-term 
variations  in  the  attributes  under  consideration  [22,25/90],  it  would 
seem  that  any  period  of  time  longer  than  four  or  five  days  between 
sessions  should  be  adequate. 

In  sum,  we  recommend  that  data  be  collected  from  400  subjects; 
their  selection  and  the  data  collection  procedure  will  be  discussed  in 
the  sections  to  follow.  The  number  of  samples  required  for  testing  is 
summarized  in  Table  1.  The  collection  should  take  place  during  two 
sessions. 
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TABLE  1 


Error 

Confidence 

No.  Of 
Samples 

No.  Of 
Samples 

Samples 

Per 

Subject 

Type 

1/  P=1 

+/-.5X 

90X 

95X 

1200 

1800 

400 

400 

3 

5 

Type 

II,  P'2 

+  /-1X 

90% 

95  X 

800 

1000 

20 

20 

2 

3 

Type 

II,  P'. 

001  +/-.G005% 

90% 

95% 

1.2x10 

1.8x10 

900 

900 

2 

3 

Enrol Iment 


20 
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3.0  THE  DATA  BASES 


We  would  like  now  to  discuss  each  data  base  separately  and  in 
turn.  For  each  data  base,  the  topics  covered  will  be:  The 
characteristic  features  of  the  attribute  and  their  variations;  and 
the  proposed  system  for  recording,  storing,  and  reproducing  the 
attribute. 

Our  aim  in  studying  the  variations  in  the  characteristics  of  the 
attributes  is  to  be  sure  that  the  data  base  explicitly  contains 
representatives  of  any  known  subgroups  of  the  population  in  proportion 
with  their  natural  frequency  of  occurrence.  This  topic  deserves  more 
discussion:  The  goal  of  a  data  base  is  to  represent  variability  of 
the  known  population  so  that  test  results  will  be  useful  in  estimating 
performance.  A  data  base  used  to  test  a  device  is  of  limited  use  if 
the  results  do  not  correspond  to  the  actual  performance  of  the  device 
in  the  real  world,  and  indeed,  this  is  a  problem  that  plagues  any 
testing  program.  If  accuracy  in  the  test  results  cannot  be 
guaranteed,  certainty  precision  can  be  guaranteed  by  sound  design. 
Such  a  data  base  would  be  useful  in  comparative  evaluation  of  systems 
and  devices  and  once  experience  is  gained,  correspondence  can  be  made 
between  test  results  and  real  world  performance. 
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There  is  a  subgroup  of  the  population  which  must  be  included  in 
all  the  data  bases.  These  are  persons  with  certain  physical 
handicaps.  For  the  voice  data  base,  speech  impairments;  for  the 
fingerprint  and  signature  data  bases,  persons  with  malformed  or 
missing  arms,  hands,  or  digits.  The  reasoning  is  clear  since  one 
would  expect  that  many  such  persons  would  have  very  high  Type  I  error 
rates,  although  their  Type  II  error  rates  would  probably  be  low.  In 
the  actual  data  collection,  it  would  most  likely  not  be  necessary  to 
collect  data  from  such  persons,  and  this  is  reflected  in  the  data  base 
specification. 


3.1  SIGNATURE  DATA  BASE 


The  signature  has  become  the  standard  means  of  identity 
authentication  in  modern  society.  It  appears  on  bank  drafts  and  legal 
documents  as  proof  of  the  signer’s  identity.  That  the  signature  is 
subject  to  forgery  is  well  known  and  because  of  this,  it  serves  mainly 
as  a  aeterent  only  to  casual  imposters.  There  are  really  two  aspects 
that  a  signature  provides  for  identity  verification.  The  first  is  the 
static,  two’-dimensional  image  itself,  signed  checks  or  contracts  fall 
in  this  category.  It  does  not  take  a  great  deal  of  skill  to  forge 
this  aspect  of  a  person's  signature.  The  second  is  the  dynamic, 
ballistic  trajectory  of  the  signature  as  it  is  produced.  Any 
witnessed  signings  fall  into  this  category  and  clearly  this  is  much 
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more  difficult  to  forge. 

It  has  not  been  proven  conclusively  that  signatures  are  unique  to 
an  individual.  Since  signatures  are  a  learned  activity,  one  could 
certainly  imagine  that  a  skilled  and  dedicated  forger  could  learn  to 
duplicate  the  exact  hand  movements  of  another's  signature,  right  down 
to  the  pressure  of  the  dot  on  an  "i",  but  such  effort  is  hardly 
practical.  Indeed,  little  is  known  about  the  ballistics  of  signatures 
or  their  attributes.  Much  of  what  fotlows  is  based  on  our  own  work 
and  con j ec tur e . 

3.1.1  Characteristic  Features  - 

We  will  concentrate  our  discussion  on  the  ballistic  history  of 
the  signature  rather  than  the  image.  The  basic  information  that  one 
might  record  would  be  position  and  pressure  (at  the  tip)  as  a  function 
of  time,  f(t)  and  p(t),  respectively.  Straightforward  differentiation 
of  f(t)  results  in  the  velocity  and  acceleration  of  the  tip,  v(t)  and 
a(t).  One  may  also  calculate  the  curvature,  x(t)  or  arc-length,  s(t), 
or  such  things  as  the  angle  of  the  pen  or  the  movement  of  some  part  of 
the  hand  during  signing.  One  m.--y  also  derive  any  function  in  terms  of 
another,  for  example,  velocity  and  acceleration  as  a  function  of 
position,  or  arc-length  as  a  function  of  pressure  and  so  on.  This 
provides  a  wealth  of  data  from  which  to  extract  features. 


"W’; 


3.1.2  Variations  In  Features  - 

All  the  quantities  mentioned  in  the  previous  section  surely  have 
some  natural  range.  The  position,  f(t),  varies  over  a  range  of  a  few 
centimeters,  perhaps  up  to  1C  in  the  horizontal  direction,  velocities 
are  on  the  order  of  101cm/sec,  accelerations  are  on  the  order  of 
102 cm/sec2.  Maximum  velocities  probably  occur  in  the  middle  of  long, 
slightly  curved  or  straight  arcs,  and  maximum  acceleration  occurs  at 
points  of  reversal  of  direction  between  two  such  arcs.  (See  Figure 
1.) 


It  is  difficult  to  see  systematic  variations  in  ar.y  of  these 
features  that  lead  to  any  subgroup  of  the  population.  Out  of 
intuition,  one  would  suspect  that  handedness  and  possibly  gender  may 
systematically  affect  handwriting.  The  left-handed  mechanics  of 
handwriting  are  simply  different  than  the  right-handed,  and  this  may 
be  evidenced  in  the  production  of  a  signature,  if  not  within  the 
completed  image.  Everyone  has  certainly  remarked  at  one  time  or 
another  that  a  piece  was  "written  in  a  woman's  hand".  These 
suspicions  are  borne  out  by  test  results  of  an  actual  device.  C91D 

It  would  be  appropriate  then,  to  distribute  the  handwriting  data 
base  according  to  gender  and  handedness. 
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3.1.3  Proposed  Data  CAT  System  For  Signatures  - 

Me  must  reproduce  the  signature  as  a  ballistic  trajectory.  The 
variables  required  to  do  this  are  simply  the  position  as  a  function  of 
time/  X (t)  an  Y(t)  and  as  a  substitute  for  the  Z  coordinate/  the 
pressure  as  a  function  of  time,  P(t).  A  spatial  resolution  of  .01  in. 
(.25  mm)  should  be  sufficient.  A  sampling  frequency  of  100  Hz  results 
in  approximately  1CC0  data  points  per  signature. 

A  standard  graphics  tablet  with  either  a  special  surface  or 
special  pen  for  pressure  sensing  would  serve  excellently  for  recording 
the  signatures.  Data  for  each  subject  would  be  collected  and 
processed  in  real-time  end  stored  on  digital  tape.  On  reproduction/  a 
modified  x-y  recorder  would  serve  as  the  output  transucer.  The 
recorder  would  be  modified  to  include  a  pressure  transducer  to 
reproduce  .the  pen  pressure.  The  drawback  of  this  system  is  that  it 
requires  either  very  special  purpose  hardware,  or  a  minicomputer  for 
supervising  the  digitization  and  recording.  This  makes  for  a  system 
that  is  costly  and  difficult  to  transport. 

Before  any  hardware  is  actually  acquired,  we  recommend  a  more 
thorough  study  of  the  range  of  velocities,  accelerations  and  pressures 
involved  in  handwriting. 
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3.2  FINGERPRINT  DATA  BASE 


Fingerpr ints(*>  have  had  a  long  history  dating  back  as  far  as  the 
third  century  A.D.  Evidence  from  this  period  suggests  that 
fingerprints  were  used  as  seals  and  identifying  marks  on  some 
documents.  It  has  only  been  in  the  last  100  years,  though,  that 
fingerprints  were  used  systematically  as  a  means  of  identifying 
people.  Their  usefulness  as  an  identifying  attribute  stems  from  two 
important  qualities.  First,  fingerprints  are  unique.  No  two 
fingerprints  have  ever  been  found  to  be  exactly  alike  and  it  is 
thought  by  experts  that  no  two  ever  will  be.  Cummins  £693  gives  an 
estimate  of  the  probability  for  two  fingerprints  to  be  identical  as 
less  than  one  chance  in  10  .  Since  fingerprint  patterns  are  partly 
controlled  by  heredity,  the  assertion  that  no  two  are  identical  is  put 
to  the  severest  test  in  the  case  of  identical  twins.  Even  in  such 
twins,  the  prints  are  at  best  only  similar,  r  )t  identical.  Secondly, 
fingerprints  do  not  change  in  form  with  age  unless  altered  surgically 
or  severly  damaged.  This  has  been  substantiated  by  observing  the 
prints  of  persons  taken  over  intervals  of  many  years  [69,713. 


(*)  The  terms  'print'  and  'fingerprint'  are  used  interchangeably  and 
refer  to  any  record  of  the  pattern  of  lines  on  the  finger,  or  to  the 
actual  pattern  on  the  finger  itself.  Whese  a  distinction  is 
important,  one  will  be  made. 
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3.2.1  Characteristic  Features  - 

Simple  examination  of  the  pattern  of  lines  on  a  finger  will 
reveal  all  the  characteristic  features.  The  pattern  consists  of 
ridges  (rugae)  separated  by  narrow  grooves  (sulci),  which  flow  across 
the  finger.  The  ridges  form  a  global  pattern  that  can  be  classified 
as  one  of  three  general  types:  Arches,  loops,  and  whorls  (see  Figure 
2).  (The  line 

drawn  on  the  pictures  of  the  loop  and  whorl  are  called  lines  of  count; 
they  are  not  important  for  this  discussion.)  The  variations  are  many 
and  it  is  often  difficult  to  make  the  distinction  between  pattern 
types,  but  such  precision  is  not  necessary  for  our  purposes. 

Closer  examination  (a  magnifying  glass  may  prove  helpful)  reveals 
more  detail.  Along  the  crests  of  the  ridges  are  tiny  impressions  that 
are  actually  the  openings  of  the  sweat  pores  (the  white  dots  in  Figure 
2).  These  are  uniquely  distributed  on  every  fingerprint  and  could  be 
used  as  identifying  features  by  s  verification  device,  but  because  of 
their  small  size,  they  are  difficult  to  detect  and  hence  are  not  of 
practical  use.  Other  local  features  of  the  print  are  obvious.  These 
are  the  breaks  and  divergences  in  the  ridge  lines  that  are  known  as 
minutiae.  These  features  are  of  four  types:  Forks  or  bifucations, 
ridge  endings,  enclosures,  and  islands  (see  Figure  3).  There  are 
approximately  AO  to  20G  occurrences  of  minutia  in  the  average  rolled 
fingerprint. 
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There  are  two  classes  of  fingerprint  verification/identification 
devices.  The  first  is  based  on  the  local  features,  the  minutia. 
Basically,  the  locations  of  the  minutia  are  extracted  and  compared 
with  the  file  prints.  The  second  is  generally  known  as  optical 
correlation  and  is  not  quite  as  successful.  Light  is  passed  through 
transparencies  of  the  test  and  file  prints  as  they  are  translated  and 
rotated.  The  transmittance  function  is  a  measure  of  the  correlation 
between  the  two  (there  is  an  equivalent  process  in  frequency  space). 
C803  We  must  thcr  're  reproduce  both  the  local  features,  the  minutia, 
and  the  global  features,  the  ridge  pattern,  from  our  data  base. 

2.2.2  Variations  In  Features  - 

Global  features  vary  continuously  and  a  progression  of  pattern 
type  can  be  distinguished  (see  Figure  4).  Pattern  1  is  an  ideal  whorl 
and 

39  is  an  ideal  arch.  Twenty-four  and  twenty-eight  are  loops.  Of 
course  the  progression  can  be  viewed  as  going  from  1  to  39  or  from  39 
to  1;  no  progression  in  terms  of  development  is  implied. 

Pattern  types  are  not  distributed  randomly  in  the  population. 
The  distinction  is  a  statistical  one;  pattern  types  occur  with 
varying  frequency  on  each  digit  of  each  hand  and  their  occurrence  is 
correlated  with  race,  gender,  ha.iu  .dness,  and  susc  ept  i  b  i  1 1  y  to 


disease.  In  general,  loops  are  the  most  abundant  patterns  and  occur 
with  most  frequency  on  the  little  finger.  Whorls  are  most  common  on 
thumb  and  ring  finger,  and  the  index  finger  has  the  highest  frequency 
of  arches. 

People  with  certain  diseases  (e.g.,  neurofibromatosis,  psoriasis, 
schizophrenia,  and  so  on)  tend  to  have  different  pattern  frequencies 
than  others  of  similar  sex  and  racial  stock  [693.  The  hypothesis  is 
that  some  of  the  same  genetic  factors  that  govern  fingerprint 
formation  also  influence  one's  susceptibility  to  disease.  Table  2 
gives  an  example  of  the  magnitude  of  the  differences  in  pattern 
frequencies  for  German  and  Danish  schizophrenics. 

The  difference  in  pattern  type  frequency  between  the  control 
groups  of  Germans  and  Danes  is  typical  of  what  Cummins  C693  calls 
racial  variations.  He  defines  his  use  of  the  work  'race*  thus: 

"The  sense  of  'race'  in  these  examples  applies  to  a 
group,  whether  comprehensive  or  limited,  marked  by 
common  characteristics  traceable  to  i nher i t anc e . " 

Table  3  is  representative  of  racial  variation*.  We  see  that  in  a 
large  sense.  Blacks  are  not  distinguishable  from  Whites,  but  Orientals 
appear  to  have  a  lower  frequency  of  occurrence  of  arches. 

Also  in  Table  3,  the  differences  between  males  and  females  is 
shown.  In  general,  females  have  more  occurrences  of  arches  than 
males  Ir  addition,  females  are  known  to  have  narrower  ridges  t  h  ~  n 
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Frequencies  of  Whorls  and  Arches  in  Three  Independent  Series 
of  Schizophrenics,  Compared  with  Controls 
From  the  General  Populations 


*  The  geneaology  of  all  these  subjects  was  traced  at  least  as  far  as  through  their 
grandparents,  and  East  Prussian  origin  of  each  generation  was  established.  In  the 
absence  of  a  control,  it  should  be  explained  that  the  higher  whorl  frequencies,  as 
compared  with  Poll's  material,  are  the  expected  associate  of  more  frequent  whorls 
in  the  general  population  of  this  territory. 


TABLE  2 
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Pattern-Type  Frequencies  -  Racial  Variations 


MALE 

FEMALE 

Arches 

Loops 

Whorls 

Arches 

Loops 

Whorls 

Tobabataks 

1.6% 

55.4% 

43.0% 

1.9% 

58.5% 

39.6% 

Koreans 

2.3 

54.4 

43.3 

2.8 

52.6 

44.6 

Chinese* 

2.5 

43.5 

54.0 

- 

- 

- 

Japanese* 

2.7 

52.8 

44.5 

- 

- 

- 

Jews 

4.6 

53.3 

42.1 

3.9 

52.7 

43.4 

Danes 

5.4 

64.8 

29.8 

7.5 

66.3 

26.2 

Negroes 

5.5 

65.6 

28.9 

8.5 

63.6 

27.9 

Germans 

6.7 

67.1 

26.2 

8.1 

64.9 

27.0 

Angola  Negroes 

6.7 

67.5 

25.8 

5.1 

64.9 

30.0 

Dutch 

7.7 

66.1 

26.2 

9.6 

67.3 

23.1 

Efe  Pygmies 

15.9 

64.4 

19.7 

17.0 

63.2 

19.8 

TABLE  3 


*  Data  for  females  not  available 
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males;  they  have  2.7  +/-.G9  more  ridges  per  centimeter  than  males 
(20.7  vs.  22.4). 

The  fineness  of  the  female  ridge  structure  can  have  a  significant 
effect  on  verification  device  performance  C56].  The  closeness  of  the 
ridges  would  seem  to  imply  a  higher  density  of  minutia  on  the  finger. 
On  this  basis  then,  the  gender  of  the  subject  is  identified  as  a 
systematic  variable. 

Handedness  (right  or  left  handed)  is  related  to  sex  variations  in 
that  it  tends  to  cancel  them.  That  is  to  say  that  left  handed  females 
tend  to  have  the  same  occurrence  of  arches  as  males.  For  more  details 
concerning  variations  in  fingerprint  patterns,  see  Cummins  [69]  and 
Holt  C71 ] . 

We  have  yet  to  specify  that  pattern  type  is  a  systematic 
variable.  Certainly,  pattern  type  frequency  does  vary  with  the  race, 
gender,  and  handedness  of  the  subject,  but  is  the  variation 
significant  to  the  identification  problem?  We  believe  not.  In  the 
case  of  minutia  based  authentication  devices,  there  is  no  evidence 
that  the  occurrence  of  minutia  is  correlated  with  pattern  type.  In 
optical  correlation,  there  is  no  reason  to  believe  that  any  pattern 
type  is  easier  to  correlate  than  the  others.  A  second  and  very 
practical  consideration  is  that  a  vast  number  of  subjects  would  be 
required  if  statistically  meaningful  data  is  to  be  collected 
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representing  the  various  combinations  and  ranges  of  pattern  type 
frequency.  Note  welt  that  we  .  are  merely  saying  that  pattern  type 
frequency  need  not  be  sampled  for  explicitly  in  the' data  base.  Random 
selection  of  subjects  sould  result  in  a  data  base  with  pattern  type 
frequencies  generally  representative  of  the  population. 

A  variation  in  fingerprints  that  is  not  related  to  pattern  type 
is  physical  damage  or  aberrations.  Damage  can  range  from  a  small  cut 
to  complete  loss  of  a  digit/  hand  or  arm.  Small  cuts  usually  heal  and 
leave  no  mark  visible  in  the  fingerprint.  Deeper  wounds  may  leave 
scars  which  result  in  permanent  disruption  of  the  pattern.  Aside  from 
damage  related  to  disease  or  accidents/  there  are  certain 
occupationally  related  abnormalities.  The  prints  of  dishwashers/ 
scrub-women/  and  workers  in  lime,  plaster  and  similar  substances 
usually  show  effects  of  prolonged  exposure  to  alkali  and  water.  The 
ridges  appear  only  faintly  and  are  di scontinuously  printed.  These 
effects  disappear  once  the  occupation  is  abandoned.  Such  variations 
should  be  adequately  sampled  by  random  selection  from  the  population. 

The  maximum  size  of  a  rolled  fingerprint  impression  is  about  5f;m 
x  5cm.  For  a  pressed  print  it  is  about  2.5cm  x  5cm.  The  ridge  width 
varies  from  .33mm  to  .75mm;  tSe  minutiae  are  of  comparable  size. 
With  inked  prints,  the  sulci  (light  lines  between  the  ridges)  are 
sometimes  smeared  or  partly  filled  in  because  of  excess  ink  or 
pressure,  and  so  vary  in  size  from  about  .5mm  in  width  to  Omm  (i.e., 
the  ridges  are  indistinguishable).  Because  of  this,  very  high 
resolution  (.C5mm)  is  neeaed  to  read  inked  fingerprints. 
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As  mentioned  previously/  the  occupation  of  the  subject  has  some 
effect  on  ridge  height  and  there  must  certainly  be  some  natural 
variation,  but  there  seems  to  be  no  information  available  concerning 
this  feature  of  fingerprints. 

.  In  sum,  beyond  the  subgroup  of  the  physically  handicapped  already 
discussed,  we  find  that  the  fingerprint  data  base  need  inlcude  males 
and  females  in  explicit  proportion  to  their  representation  in  the 
population.  Within  those  subgroups,  random  selection  of  subjects 
should  adequately  cover  all  of  the  variations  mentioned,  including 
occupationally  related  variations. 

3.2.3  Proposed  Data  CAT  System  For  Fingerprints  - 

The  most  difficult  aspect  in  designing  this  system  is  finding  a 
suitable  method  of  outputting  the  fingerprint  to  the  verification 
device.  The  two  methods  mentioned  earlier,  simulating  the  sensor 
output  and  optical  projection,  have  been  dismissea  as  inadequate.  We 
propose  to  take  molds  of  each  of  the  oigits  and  use  these  to  cast 
replicas  of  the  digits.  The  replicas  would  be  stored  and 
1 reproduc  tion'  would  consist  simply  of  removing  them  from  their 
storage  containers. 

The  verification  device  would  be  tested  by  manually  placing  the 
replica  on  the  input  sensor.  Data  acquisition  requires  no  special 
transducers,  just  a  spatula  for  mixing  and  a  mixing  pad;  there  is  no 
data  processing,  and  minimal  storage  requirements.  Accuracy  of 
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a  catalyst  and  base,  which  must  be  mixed  as  per  instructions.  The 
material  is  not  harmful  to  skin  and  is  aplied  directly  to  the 
fingertip  of  the  subject,  covering  it  entirely.  When  dry 

(approximately  6-8  minutes),  the  mold  is  removed  and  sprayed  with  a 
suitable  lubricant.  Silicone  spray  lubricant  or  PAM,^  will  suffice. 
The  same  compound  is  then  pressed  into  the  mold  and  allowed  to  set. 
Uhen  set,  the  compound  has  a  consistency  much  like  skin,  it  has  a  fine 
sensitivity  to  detail,  and  it  is  non-vol  iti le.  The  casts  are  to  be 
made  thin  so  they  can  be  glued  to  the  fingers  of  a  rubber  glove  on 
each  corresponding  fingertip.  The  gloves  should  be  kept  in  a  cool, 

dry,  dark  place  to  minimize  deterioration.  To  test  a  device,  a 

technician  places  his  hand  in  the  gtove  and  follows  the  enrollment  and 
test  procedure  determined  by  the  device  undergoing  testing.  In  this 
way  all  individuals  are  ' repcoduc ed *  in  tne  test. 

Ue  have  produced  a  small  sample  of  these  fingerprints  and  found 
the  quality  to  be  quite  good.  The  Calspan  fingerprint  authentication 
device  in  the  laboratory  at  RADC  was  able  to  register  the  ridge 

patterns  of  the  'reproduced'  fingerprint,  so  we  believe  this  method 

will  prove  quite  successful.  This  data  base  will  be  simple  and 
inexpensive  to  collect,  maintain,  and  reproduce  and  cause  minimal  user 
discomfort. 


29 


I 


The  human  voice  is  marvelous  in  its  capabilities  and 
applications.  It  is  the  primary  mode  of  human  communication.  Subtle 
inflections  and  rhythms  convey  the  gamut  of  human  emotions  and 
intentions. 

To  misquote  an  old  adage/  how  many  ways  are  there  to  say  "I  love 
you"?  These  same  words  can  be  said  in  all  sincerity/  mockingly, 
playfully,  derisively,  hopelessly,  lovingly,  an  so  on,  and  so  on; 
always  the  same  words,  it  is  the  way  they  are  said  which  conveys  the 
meaning.  The  extent  to  which  the  intended  meaning  and  perceived 
meaning  coincide,  however,  depends  on  the  skill  of  the  speaker  and 
awareness  of  the  listener.  Every  Don  Juan  worth  his  salt  will  have 
command  of  many  modes  of  expression,  and  will  be  able  to  manipulate 
the  articulators  of  speech  (among  other  things  such  as  facial 
expression  and  hands)  to  produce  the  proper  cadence,  emphasis,  and 
timing  to  convey  a  larger  message,  a  more  informative  message,  than 
just  the  words  might  convey.  The  world  about  us  is  full  of  so  many 
examples  of  how  proper  application  of  the  voice  means  more  than  saying 
the  right  words.  A  good  comedian  tells  a  funny  joke;  a  bad  one  tells 
the  same  joke  and  it's  not  funny.  The  aifference  is  timing,  the  good 
comedian  would  probably  say  (at  least  that's  what  Johnny  Carson  says: 
The  joke  goes  something  like,  "People  with  good  timing  either  become 
comedians  or  parents."). 
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What  is  it  in  speech  that  allows  the  seme  words  to  say  so  many 
different  things?  There  are  three  ancillary  sources  for  extra 
information.  One  is  visual  cues;  hand  motions/  facial  expressions 
and  so  on.  These  play  an  important  role/  but  we  have  no  interest  in 
them  for  this  project.  Another  is  context.  *fords  uttered  in 
differing  contexts  change  not  only  their  connotations/  but  even  their 
meaning.  Context  is  of  interest  here  only  in  how  it  interacts  with 
the  last  source;  the  ’quality1  of  speech.  By  ’quality*  we  mean  the 
emphasis/  rhythm/  tone/  and  so  on/  which  a  speaker  controls  in 
uttering  any  phrase.  These  are  the  factors  whose  proper  manipulation 
make  speech  sound  natural,  and  the  degree  to  which  this  can  be  done 
determines  the  success  of  one’s  ability  to  reproduce  speech. 

Human  beings  have  the  innate  ability  to  manipulate  these  factors 
and  they  employ  these  abilities  with  greater  or  lesser  skill.  The 
most  talented  or  influential  or  persuasive  speakers  express  the 
ultimate  control  over  not  only  the  quality  of  their  voice,  but  also 
the  text,  visual  cues,  and  context  of  each  phrase.  Machines,  however, 
nave  no  such  abilities  ano  so  must  first  be  given  them,  then  ’taught’ 
to  use  them.  As  ue  have  said,  this  is  the  key  -isk  area  in  this 
effort. 

Speech  is  produced  when  a  pressure,  built  up  in  the  lungs,  is 
forced  past  the  vocal  chords  and  through  the  oral  and  nasal  cavities. 
Tnere  are  two  basic  modes  of  speech.  The  first  is  when  the  vocal 
chores  are  held  closed.  Subglottal  air  pressure  builds  until  it 
forces  the  vocal  chords  open  and  a  burst  of  air  passes.  The  vocal 
cnords  close  once  more  and  the  cycle  repeats.  The  period  of  the  cycle 


is  known  as  the  pitch  period.  The  oral  and  nasal  cavities  form  a 
resonant  cavity  which  is  excited  by  the  pulse  of  air  coming  from  the 
vocal  chords/  giving  rise  to  what  is  known  as  voiced  speech.  On  the 
other  hand,  if  the  vocal  chords  are  held  open/  the  speech  is  called 
unvoiced.  The  excitation  of  the  resonant  cavity  is  furnished  by  air 
rushing  past  a  constriction  in  the  vocal  tract,  giving  rise  to  a 
noise-like  excitation.  There  is  no  pitch  period  for  this  type  of 
speech. 

3.5.1  Characteristic  Features  - 

lie  must  first  decide  just  what  we  mean  by  characteristic  features 
of  speech.  Do  we  mean  the  characteristic  features  of  the  speech 
signal  waveform,  such  as  its  statistics,  frequency  structure,  or 
energy  content,  or  do  we  mean  the  perceived  characteristics  of  the 
human  voice?  There  are  no  compelling  arguments  that  either  of  those 
approaches  are  more  appropriate  from  a  technical  point  of  view.  Both 
are  equivalent  end  for  the  most  part,  independent.  Cut  of 
convenience,  we  choose  to  consider  the  perceived  speech 
characteristics.  These  characteristics  are  simply  those  which 
distinguish  dialects  of  the  language  in  linguistics.  This  approach  is 
more  convenient  because  of  the  relatively  larger  amount  of  information 
concerning  dialects  and  also,  because  of  greater  ease  in  screening 
subjects.  If  subgroups  cf  tne  population  are  identified,  say  by  a 
particular  format  structure,  then  all  subjects  would  have  to  first  be 
screened  by  analyzing  the  format  structure  of  their  speech.  Tnis  adcs 
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enormously  to  the  effort  required  to  collect  the  data.  There  is, 
however,  one  feature  of  the  speech  waveform  which  defines  easily 
distinguishable  subgroups  of  the  population.  This  feature  is  the 
average  pitch  frequency  which,  for  females,  is  about  twice  as  high  as 
males.  This  fact  is  known  to  cause  difficulties  for  verification 
devices.  According  to  Rosenberg,  "The  difficulties  associated  with 
analysis  of  female  speech  are  well  known.  The  fundamental  problem  is 
the  loss  of  spectral  resolution  compared  with  analysis  of  male 
speech."  C 1 63  The  loss  of  spectral  resolution  is  due  to  the  higher 
average  pitch  frequency,  leading  to  more  widely  spaced  harmonics  and 
less  information  in  a  given  frequency  range.  We  have  then  the 
immediate  result  that  the  sample  g-oup  should  be  divided  according  to 
gender. 

Dialect  is  a  subjective  concept:  "Dialects  are  merely  the 
convenient  summaries  of  observers  who  bring  together  certain 
homogeneities  of  the  speech  habits  of  a  group  and  thus  secure  for 
themselves  an  impression  of  unity.  Ocher  observers  might  secure 
different  impressions  by  assembling  different  habits  of  the  same 
group."  1523  Fortunately,  precision  in  determining  an  absolute  dialect 
for  each  subject  is  not  required,  we  wish  only  to  assure  that  the 
sample  population  represents  the  major  dialectal  subgroups.  To  do 
this  we  will  attempt  to  identify  the  factors  which  affect  one's 
dialect  and  from  there  we  can  identify  the  subgroups  as  those  people 
for  which  those  factors  are  important. 


The  first  major  factor  in  determining  dialect  is  the  mother 
tongue  of  the  subject:  Mother  tongue  being  the  first  language  one 
acquires.  If  it  is  other  than  American  English,  then  such  a  person 
will  speak  English  with  a  foreign  accent.  This  is  not  a  dialect  of 
American  English  in  a  strict  sense;  it  is,  however,  a  variation  with 
which  we  must  contend.  Among  those  with  foreign  accents  we  include 
persons  whose  mother  tongue  is  British  English  since  British  English 
is  spoken  differently  from  American  English.  We  should  note  here  that 
for  our  purposes,  vocabulary  and  usage  are  not  important  factors  in 
determining  dialect.  We  are  concerned  mainly  with  pronunciation, 
although  it  is  true  that  such  factors  undergo  similar  variations. 
That  is  to  say,  if  a  person  uses  a  word  differently  from  another,  it 
is  more  than  likely  that  he  pronounces  it  differently  also. 

The  next  most  important  element  in  determining  dialect  is  the 
region  of  origin  of  the  speaker.  These  influences  result  from  local, 
regional  variations  in  speech  and  are  established  in  a  child  by 
adolescence.  It  is  not  possible  to  draw  definitive  regional 
boundaries,  and  every  expert  will  propose  slightly  different  ones,  but 
as  we  have  said,  precision  is  not  required.  The  map  in  Figure  5  gives 
an  acceptable  subdivision  of  the  United  States  into  ten  linguistic 
regions. 

In  general,  socioeconomic  status  has  a  profound  affect  on  the 
nature  and  extent  of  linguistic  variation.  A  typical  example  is  given 
in  C47]  for  the  occurrence  of  postvocalic  *r'  absence: 
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Map  showing  the  major  regional  speech  areas: 

A:  Eastern  New  England;  B:  New  York  City;  C:  Middle 
Atlantic;  D:  Southern;  E:  Western  Pennsylvania; 

F:  Southern  Mountain;  G:  Central  Midland;  H:  Northwest; 
I:  Southwest;  J:  North  Central. 


Figure  5. 


35 


Socioeconomic  Class  Mean  %  'r'  Absence 

upper  middle  20.8 

lower  middle  38.8 

upper  working  61 . 3 

lower  working  71.7 

The  middle  classes  show  more  homogeneous  speech  habits  across 
regional  boundaries#  the  lower  classes  exhibit  the  regional 
peculiarities  more  strongly,  though  this  may  be  less  true  in-the  South 
and  Southern  Mountain  regions  where  upper  and  middle  class  speakers 
speak  a  fairly  strong  regional  dialect.  According  to  Wolfram  and 
Fasold  C47'J,  the  best  indicators  of  socioeconomic  status  are 
education,  occupation,  income  (both  source  and  amount),  house  type, 
and  dwelling  area. 

There  is  a  dialect  known  as  Vernacular  Black  English  which  is 
common  only  among  lower  els  s  urban  blacks.  This  fact  brings  us  to 
the  question  of  the  effect  of  the  speaker's  race  or  ethnic  background 
on  his  speech.  It  has  been  proposed  that  there  are  physical  features 
of  vocal  tracts  that  differ  according  to  race;  this  especially  in 
connection  with  Vernacular  Black  English.  However,  this  proposal  is 
not  generally  accepted  by  linguists  and  comparative  anatomical  studies 
do  not  support  it.  Aspects  of  linguistic  behavior  that  are  highly 
correlated  with  race  (more  specifically,  highly  correlateed  with  being 
black)  are  due  to  factors  which  cause  the  black  community  to  be  highly 
segregated  socially  from  general  American  influence.  No  other  racial 
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or  ethnic  group,  except  of  course  those  groups  whose  mother  tongue  is 
not  English,  show  any  systematic  variation.  Studies  have  shown  that 
the  dialect  of  Puerto  Ricans  in  New  York  City  is  affected  most  by 
their  peer  group  contacts,  even  when  there  is  strong  parental 
influence  in  other  directions  C473.  The  persistence  of  Vernacular 
Black  English  is  easy  to  understand  in  this  light;  people  growing  up 
in  the  urban  black  community  are  affected  most  be  their  peers  and 
since  urban  black  neighborhoods  are  inevitably  segregated,  those  peers 
speak  Vernacular  Black.  The  dialect  is  perpetuated  by  the  same  social 
forces  that  perpetuate  segregation.  The  influence  of  peer  groups  is 
far  reaching.  Quoting  Wolfram  and  Fasold  [47],  "Although  interference 
from  a  foreign  language  may  be  quite  obvious  in  the  speech  of 
first-generation  immigrants,  straightforward  interference  from  another 
language  is  of  little  or  no  significance  for  the  second  and 
third-generation  immigrant."  This  is  because  English  language  skills 
,are  acquired  through  peer  group  contacts.  This  is  indeed  an  important 
point.  One  may  at  first  suspect  that  not  only  persons  whose  mother 
tongue  is  not  English  should  be  accounted  for,  but  also  those  who  grew 
up  in  households  where  the  predominant  language  was  not  English  should 
be  accounted  for.  Fortunately,  we  see  that  this  is  not  the  case  since 
the  mechanism  of  pe^*r  group  influence  tends  to  homogenize  speech 
patterns  within  a  given  community.  For  our  purposes,  speakers  of 
Vernacular  Black  English  form  a  recognizable  subgroup  of  the 
popul at i on. 
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3.3.2  Variations  In  Features  - 

What  sorts  of  variations  are  there  among  the  different  dialects? 
Besides  variations  in  vocabulary  and  usage,  the  major  difference  is  in 
pronunciation,  principally  of  the  vowels  or,  more  generally,  voiced 
sounds.  Referring  to  the  map  in  Figure  5,  the  variations  seen  in 
regions  G,  H,  I,  and  J  are  subtle.  In  fact,  many  linguists  classify 
inhabitants  of  these  regions  as  all  speaking  one  dialect  known  as 
general  American  English.  Speakers  from  the  southern  region,  tend  to 
slur  and  elongate  vowel  sounds.  This  changes  the  rhythm  of  the  speech 
and  gives  rise  to  the  Southern  drawl.  Residents  of  the  New  England 
area  tend  to  nasalize  vowels  which  results  in  the  "New  England  twang". 
Persons  from  central  Pennsylvania  have  a  unique  dialect  known  as 
Pennsylvania-Dutch.  It  results  from  German  (Deutsch)  influence  rather 
than  Dutch  influence,  as  it  first  might  be  throught,  and  is  marked  by 
confusion  of  sounds  such  as  1 b *  and  *  p * ,  1  a '  and  ' t ' ,  and  others. 
There  is,  of  course,  much  richer  regional  variation  than  outlined 
here,  however,  the  details  are  not  important. 

There  are  variations  in  the  speech  signal  which  are  important. 
The  maximum  frequency  range  of  the  human  voice  is  approximately 
50-6C00  Hz,  although  there  is  very  little  information  in  the  higher 
frequencies.  The  dynamic  range  of  the  human  voice  is  30  -  4C  cB  C55D. 
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3.3.3  PROPOSED  DATACAT  SYSTEM  FOR  VOICES 


As  we  have  said  above,  it  is  clear  that  some  form  of  speech 
synthesis  is  required  in  reproducing  the  speech  data  base.  The  method 
chosen  for  the  synthesis  will  determine  the  details  of  the  system,  but 
the  general  form  it  will  take  is  clear.  A  set  of  phrases  will  be 
specified  which  contain  all  the  phonetic  events  required  for 
synthesis.  These  phrases  will  be  recorded  for  each  subject  on  analog 
tope.  Thus,  the  data  collection  equipment  is  inexpensive  and 
portable.  The  analog  tape  is  then  brought  to  the  computer  facility 
where  it  is  digitized.  Phonemes  are  then  selected  to  form  the  phoneme 
data  base.  The  term  phoneme  is  used  here,  not  in  the  linguistic 
sense,  but  in  the  broad  sense  meaning  the  fundamental  building  blocks 
the  speech  will  be  constructed  from.  From  the  phoneme  data  base,  the 
test  utterances  required  by  the  verification  device  are  constructed, 
then  converted  to  analog  form  and  used  for  the  test.  This  procedure 
is  diagrammed  in  Figure  6. 

The  analog  data  base  is  stored  on  analog  tape,  the  digitized 
speech  is  stored  on  digital  tape,  as  is  the  phoneme  data  base.  The 
digitized  test  utterance  can  be  held  on-line  on  disk  or  off-line  on 
digital  tape. 

What  exactly  is  required  from  our  speech  synthesis?  We  must 
reproduce  a  speaker:  We  must  collect  data  from  a  subject  and  use  it 
to  reconstruct  his  speech.  One  might  call  it  speaker  synthesis.  It 
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was  clear  from  the  outset  that  this  was  not  a  trivial  task.  From  a 
linguistic  point  of  view,  speech  is  not  characterized  well  enough  on 
an  individual  level  such  that  all  of  a  person's  speech  habits,  his 
personal  dialect,  can  be  known  from  some  limited  set  of  data.  From  a 
technical  point  of  view,  state-of-the-art  capabilities  allowed  the 
synthesis  of  natural  sounding  speech.  To  generate  speech  that  was  not 
only  natural,  but  sounded  like  some  individual,  would  take  another 
advance  in  the  state-of-the-art.  There  are  some  aspects  of  this 
problem,  however,  which  allow  for  compromise.  The  verification  device 
under  test  does  not  have  to  verify  that  the  reproduced  voice  be  the 
same  as  the  original  speaker.  It  is  required  only  to  distinguish 
utterances  constructed  from  one  phoneme  set  from  those  constructed 
from  all  other  sets,  with  the  specified  accuracy.  This  eases  the 
requirements  somewhat.  We  do  not  have  to  reproduce  a  set  of  human 
speakers,  we  have  only  to  produce  a  set  of  voices  whose 
characteristics  are  representative  of  the  population.  The 
specifications  then,  for  the  quantity  and  type  of  data  to  be  collected 
are  crucial  since  it  is  here  that  the  data  base  makes  contact  with  the 
real  world.  Additionally,  the  psychology  of  entry  control  argues  that 
the  users  will  grow  accustomed  to  the  system  and  will  learn, 
subconsciously,  to  repeat  the  verification  phrase  in  such  a  way  as  to 
gain  access.  Such  a  system  is  a  classic  example  of  what  psychologists 
call  operant  conditioning,  with  the  reward  being  successful  access. 
Untold  numbers  of  rats  have  learned  to  run  mazes  in  just  this  fashion. 
Experience  with  existing  systems  supports  this  supposition.  In  fact. 
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an  entry  conrol  device  may  actually  use  different  decision  strategies 
for  users  new  to  the  system  and  those  experienced  with  it  C36i.  The 
new  users  are  judged  less  strictly  while  they  adjust  to  and  learn  the 
system.  What  this  all  means  is  that  the  tremendous  variety  and 
richness  of  which  speech  is  capable  will  not  be  present  in  a 
verification  situation.  Just  as  one  would  expect,  a  typical  user  of 
the  system  will  not  expound  grandly  his  verification  phrase  one  day, 
then  coo  softly  on  the  next.  He  would,  in  general,  recite  it  in  the 
same  pat  manner  as  he  oid  originally  during  enrollment.  Since  the 
context  of  the  situation  end  phrase  never  change,  no  variation  in 
pronunciation  should  be  expected  due  to  context,  and  finally,  of 
course,  visual  cues  or  motions  are  of  no  consequence.  In  short,  we 
now  find  that  it  is  not  ncessary  to  reproduce  a  specific  speaker,  nor 
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the  utterance  required  in  a  natural  sounding  voice. 

Let  us  now  discuss  our  actual  speech  synthesis  system.  It  is 
based  on  the  source  filter  model  of  speech  production  as  depicted  in 
Figure  7. 

In  this  model  it  is  assumed  that  the  exciting  source,  the  vocal 
chords  or  a  vocal  tract  constriction  is  linearity  separable  from  the 
remainder  of  the  vocal  tract,  which  acts  like  a  fitter.  As  we  have 
said  before,  the  exciting  source  is  either  a  pulse  train  or  white 
noise.  The  filter  can  be  any  appropriate  filter  either  real  or 
modeled.  Of  course,  because  of  the  flexibility  available,  this  system 
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is  ideally  simulated  on  a  general  purpose  digital  computer.  Welt 
established  techniques  exist  for  estimating  the  filter  characteristics 
from  the  speech  signal.  Assuming  the  filter  can  be  adequately  modeled 
by  an  ell  pole  filter  of  moderate  size,  linear  predictive  coding 
technique  turns  out  to  be  very  useful  [323. 


Let  the  discrete  time  series  output  of  the  system  be  sn ,  the 
previous  outputs  sn  and  inputs  un ,  then  the  system  can  be  modeled  by: 
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G  is  the  gain  factor.  Taking  the  z  transform,  the  transform  function 
of  the  filter  is  given  as: 
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where  S(z)  and  U(z)  are  the  z  transforms  of  the  output  and  input. 
This  is  known  as  the  pole-zero  model.  The  transfer  function  can  be 
estimated  to  any  desired  degree  of  accuracy  if  all  b^^  =  0,  then: 

ll(z)  =  G  - - - 


i  +  i  a  z 
k=l  K 


The  problem  is  to  estimate  the  a,  the  linear  predictive  coefficients, 

K 

and  to  choose  p  such  that  the  filter  is  determined  to  the  desired 
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degree  of  accuracy.  Procedures  for  doing  this  are  well  established  in 
the  literature,  and  one  of  the  most  common  is  known  as  the  method  of 
least  squares. 

Assume  the  input  un  to  the  system  is  not  known.  The  output  ^ 
can  then  only  be  aproximated  by: 
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The  error  e  between  the  actual  value  and  the  predicted  value  is 

n 

simply  the  difference: 
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en  is  also  known  as  the  residual,  and  the  ^  can  be  determined  by 
minimizing  the  mean  total  squared  error.  The  result  is  a  set  of  p 
simultaneous  equations  in  p  unknowns  and  is  the  same  for  a 
deterministic  or  random  signal.  Computationally  economical  methods 
are  known  for  solving  these  equations  and  from  them  we  have  chosen  to 
implement  the  auto-correlation  method. 


An  added  benefit  from  the  auto-correlation  method  is  a  secondary 
set  of  coefficients  known  as  the  partial  correlation  or  reflection 
coefficients.  The  term  reflection  coefficients  arises  from 
transmisson  line  theory  where  the  reflection  coefficients  are  actually 
those  of  the  boundary  between  two  regions  of  differing  impedence  with 
a  plane  wave  normally  incident  at  that  boundary.  In  the  case  of 
speech,  the  natural  transmission  line  is  an  accoustic  tube  made  up  of 
equal  length  sections  of  constant  but  differing  cross-sectional  area. 
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The  analog  speech  is  digitized  at  12. £  kHz,  giving  an  effective 
bandwidth  of  6. A  kHz.  We  use  a  20  ms  processing  frame  Length  which 


corresponds  to  256  data  points  per  frame.  Each  frame  of  digitized 
speech  is  encoded  using  linear  predictive  coding.  The  data  is 
pre-emphasi zed  then  windowed  with  a  256  point  Hamming  window,  then  the 
voicing,  pitch  period,  LPC  coefficients,  reflection  coefficients  and 
cross-sectional  areas  are  extracted  for  each  processing  frame.  Each 
phoneme  is  represented  by  one  frame  of  data.  Phoneme  selection  is 
interactive  and  aided  by  waveform  displays  and  automatic  phoneme 
recognition.  The  operator  must  make  the  final  determination  of  which 
frame  represents  the  desired  phoneme.  A  library  containing  all  the 
required  phonemes  will  be  assembled  for  each  subject. 

To  construct  a  new  utterance,  the  operator  specifies  a  string  of 
phonemes  along  with  a  relative  gain  and  pitch  and  a  duration.  Because 
pitch  and  duration  are  under  operator  conrol,  he  is  responsible  for 
obtaining  the  proper  prosody.  The  difficulty  with  this  appracch  to 
speech  synthesis  lies  in  handling  the  transition  from  one  phoneme  to 
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the  next.  The  implicit  assumption  is  mode  that  connected  speech  can 
be  modeled  as  a  series  of  steady  state  phonemes/  reasonably  invariant 
from  occurrence  to  occurrence/  which  are  connected  by  smooth 
transitions.  The  speech  synthesis  program  must  calculate  the 
transitions. 

Others  have  tried  this  approach  and  have  met  with  little  success 
because  they  calculated  transitions  by  interpolating  between 
successive  sets  of  LPC  coefficients.  There  is  no  reason  to  believe 
that  this  scheme  has  any  physical  basis  and  indeed/  If  one  looks  at 
the  time  history  of  the  LPC  coefficients/  one  finds  they  do  not  change 
smoothly.  The  solution  is  to  interpolate  on  a  set  of  physically 
meaningful  coefficients,  the  cross-sectional  areas.  An  extensive 
survey  of  the  cross-sectional  areas  in  natural  speech  has  resulted  in 
interpolation  rules.  Our  experiments  in  this  area  show  that  this 
method  does  work.  We  have  constructed  a  set  of  phrases  taken  from  the 
Texas  Instruments  Automatic  Speaker  Verification  System  C383.  These 
phrases  have  good,  natural  sounding  quality  and  can  be  recognized  as 
the  voice  of  the  original  speaker. 

The  operator  has  the  complete  capability  to  audition  the 
constructed  utterance  and  make  changes  he  deems  appropriate.  The 
operator  should  have  expertise  in  dialectology  so  that  he  will  be 
useful  in  segmentation  and  construction. 


Each  subject  in  this  data  base  has  three  permanent  data  sets. 
The  analog  recording  of  the  original  passage  on  audio  tape,  the 
digitized  version  of  this,  and  the  phoneme  library,  which  is  also 
digital.  These  are  most  economically  stored  on  magnetic  tape,  the 
format  depending  on  the  particular  computer  installation  on  which  the 
processing  was  done.  In  our  research,  a  DEC  POP  11/70  was  used.  The 
digitized  data  is  stored  in  512  byte  (256  data  point)  blocks,  as 

unformatted  2  byte  integers.  The  phoneme  data  base  is  stored  as 
unformatted  4  byte  real  data.  For  device  testing,  the  phoneme  library 
for  each  subject  is  brought  on-line  from  tape,  the  test  utterances  are 
constructed,  then  stored  on  digital  tape  in  the  same  format  as  the 
original  digitized  utterances.  After  the  processing  is  completed, 
these  can  be  played  out  to  the  device  through  the  D/A  interface, 
amplifier  and  loudspeaker. 
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4.C  DATA  BASE  COMPOSITION 


We  have  discussed  so  far  the  quantity  of  data  required,  its  form 
and  methods  of  storage  and  reproduction.  We  will  now  descibe  the 
actual  data  collection. 

The  most  economical  way  to  collect  the  data  base  will  be  to  use 
portable  equipment  which  can  be  brought  to  the  collection  site.  We 
recommend  the  data  be  collected  at  U.S.  military  installations  since 
all  subjects  required  are  likely  to  be  found  there.  Each  subject  will 
be  sampled  for  his  fingerprints,  signature  and  voice.  Care  should  be 
taken  in  screening  subjects  and  to  insure  accurate  data.  Both 
civilian  and  military  personnel,  officers,  and  enlisted  men  should  be 
included  in  the  population. 

We  have  identified  subgroups  of  the  population  for  each  attribute 
and  the  sample  should  be  assembled  accordingly.  The  sample  should  be 
half  male  and  half  female.  Each  of  these  groups  should  then  be 
divided  according  to  mother  tongue,  then  region  of  origin.  Within  the 
smallest  subdivisions,  subjects  should  be  drawn  at  random  (see  Figure 
£).  This 

satisfies  the  requirement  for  the  voice  and  fingerprint  data  base,  but 
the  signature  data  base  requires  it  be  distributed  according  to 
handedness.  Further  subdivision  of  the  population  is  undesirable 
since  small  numbers  of  persons  in  a  subdivision  would  not  lead  to 
statistically  meaningful  results.  Rather  than  increase  the  sample 
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sample  group 


igure 


size,  we  believe  that  the  random  draw  will  adequately  sample  the  right 
and  left  handed  populations.  Table  3  gives  a  breakdown  of  the  U.S. 
population  into  the  subgroups  we  have  identified.  Of  the  400 
subjects,  2C0  will  be  male  and  2CC  female.  The  last  column  in  Table  4 
gives  the  number  of  male  and  female  subjects  required  for  each  region. 

The  entire  population  should  be  characterized  in  terms  of  the 
important  variables  that  we  have  defined:  gender,  mother  tongue,  and 
region  of  origin.  Only  those  raised  from  birth  through  adolescence  in 
one  region  shall  be  considered  as  true  members  of  that  subgroup. 
Others  may  speak  with  a  dialect  reflecting  the  influence  of  two  or 
more  different  regions;  similarly  with  mother  tongue.  Once  so 
divided,  names  can  be  drawn  at  random  and  the  named  person  can  be 
asked  to  participate  in  the  study.  The  voluntary  participation  of  the 
subject  should  give  some  confidence  that  he  will  be  cooperative.  So 
as  not  to  stretch  our  confidence  in  human  nature  too  far,  we  suggest  a 
smalt  monetary  compensation  may  buy  a  little  more  cooperation. 

Once  the  subject  has  been  secured,  a  short  briefing  explaining 
the  purpose  of  the  project  should  be  given  to  orient  the  subject  and 
to  give  him  time  to  relax.  Every  subject  should  be  reassured  that  the 
information  collected  in  this  study  will  be  used  only  for  the  stated 
purpose  and  will  not  be  circulated  without  his  permission.  The 
fingerprints,  being  unaffected  by  the  emotional  state  of  the  subject, 
should  be  collected  first.  The  signatures  (signing  being  a  very 
natural  act)  should  be  collected  next,  then  finally  the  voie  data 
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Region 

Amer i c  an 

Eng  l  i sh 

% 

Cut  of  1 60 
Samples 

Eastern  New  England 

5,085,18 D 

3.7 

6 

New  York  City 

6,631,491 

4.2 

7 

Hid  Atlantic 

12,676,239 

2.C 

12 

Southern 

37,027,323 

22.5 

37 

Western  Pennsylvania 

4,572,180 

2.9 

5 

Southern  fountains 

0,445,017 

5.2 

9 

Hid  Central 

25,056,608 

15. G 

25 

Northwest 

5,262,160 

2.7 

6 

Southwest 

15,575,619 

5.5 

16 

North  Central 

36,257,947 

22.9 

26 

Total 

158,049,795 

1CC.C 

1  6C 

Mon  American 

Cut  of  40 

Mother  Tongue 

Eng  l  i  sh 

% 

Samples 

Spani sh 

7,822,523 

17.5 

0 

German 

6,052,054 

14.0 

7 

Italian 

4,144,215 

5.5 

5 

French 

2,592,400 

6.  C 

h 

Polish 

2,437,520 

5.6 

\ 

English 

1,657,025 

3.9 

5 

Yiddish 

1,592,993 

2.7 

2 

Russian 

234,565 

r  r 

c  i  L 

i 

Other 

0,149,266 

1  0.7 

£ 

Not  Reported 

0,764,250 

20.1 

-  (*) 

Total 

42,627,205 

1  0  0 . 0 

4C 

TABLE  4 

(*)  Subjects  credited 
evenly  among  al  l 

to  'unreported'  were  cistributed 
other  categories  (one  additional 

subject  for  each). 
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base.  The  voice  recording  should  be  done  in  a  sound  booth  or  a  quiet 
room.  The  subject  should  be  given  time  to  familiarize  himself  with 
the  text  to  be  recited,  and  any  ambiguities  or  questions  should  be 
cleared  up  prior  to  recording.  The  recording  should  not  be  rushed  and 
the  subject  should  be  allowed  to  pause  if  desired.  All  precautions 
should  be  taken  to  insure  recording  the  subject  in  as  natural  a  state 
as  possible.  Figure  9  is  a  list  of  equipment  required  for  recording 


the  voice  data  base 


List  of  Data  CAT  Speech  Processing  Hardware 


item  Specificetion/Recommendstions 


Mi c rophone/Preemp  Frequency  response  50-6000  Hz 

carcoid  condenser/ FET  preamp, 
windscreen,  associated  hardware 
AKG  CK1  Condenser  Mic 
C 451 E  FET  Preamp 
W3  Windscreen 

Linear  Audio  Amplifier  Frequency  response  50-6CGC  Hz 

Variable  Gain 
S/M  >  6 0 d b 

Bandpass  Audio  Filter  Low  cut  50  Hz 

High  cut  60CG  Hz 
S/M  >  6CdS 

Analog  Audio  Tape  Frequency  Response  50-6GGC  Hz 

Recorder/Player  S/M  >  60c2 

(2  or  4  track)  THO  <  C.5% 

Audio  Loudspeaker  Frequency  Response  5C-6CC0  Hz 

High  Efficiency, 

4  or  C  Ohms 

Digital  Tape  -  Large  Disk 

>  12  bit  A/D,  D/A 

>  120 CC  Hz  Sampling  Rate 
DEC  POP  11/70  w/  LPA11-K 

RPC4,  TUI  d 

Graphics  Terminal  Waveform  Display 

Tektronix  4C14 

Miscellaneous  Cables;  Connectors; 

Magnetic  Tape 


Computer  w/Analog 
Interface 


Figure  9 


5.G  CONCLUSIONS  AND  RECOMMENDATIONS 


We  believe  that  the  Data  CAT  approach  is  basically  sound.  The 
relatively  high  expense  of  collecting  the  signature  data  base,  weighed 
against  its  possible  uses,  leads  us  to  believe  that  this  effort  would 
not  be  cost  effective.  The  fingerprint  data  base  is  extremely  easy 
and  inexpensive  to  collect  and  could  prove  very  useful  in  testing  not 
only  fingerprint  identification  devices,  but  also  fingerprint 
recognition  and  classification  devices.  Since  this  data  base  need 
only  be  subdivided  by  gender,  the  collection  could  take  plare  in  any 
population  center  and  Urge  number  of  fingerprints  could  be  included 
in  the  data  base  at  very  low  cost. 

The  voice  data  base  has  a  moderate  initial  cost  due  to  the 
acquisition  of  required  equipment,  and  the  screening  of  subjects  is 
more  costly,  but  the  potential  benefits  are  very  great  considering  the 
growing  field  of  speech  identification  and  recognition.  As  with  the 
fingerprint  data  base,  the  speaker  synthesis  system  and  voice  data 
base  can  be  useful  testing  both  speaker  identification  and  speech 
recognition  devices. 

Since  this  is  a  new  application  of  new  technology,  it  may  be  wise 
to  proceed  cautiously  in  its  develoment.  The  cost  of  the  aci  ial  data 
collection  will  obviously  far  outweigh  the  cost  of  equipment  and 
software  required  for  the  processing.  However,  the  capability  to 
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ocquire  the  data  base  would  not  be  that  costly.  The  speaker  synthesis 
unit,  consisting  of  a  host  comuter,  analog  interface,  graphics 
capability,  software,  and  associated  analog  equipment  could  be 
procurred  ana  a  small  data  base  collected  for  minimal  cost.  This 
would  allow  the  user  the  opportunity  to  prove  the  technology  under 
laboratory  conditions  and  clso  establish  baseline  performance  and  real 
world  performance,  thus  avoiding  the  typical  pitfall  of  precise  but 
inaccurate  testing. 

As  an  added  bonus,  software  and  techniques  developed  under 
contract  F30602-79-C-G226,  known  as  UNITRANS,  also  sponsored  by  RADC 
and  recently  completed  by  PAR  [923  could  be  easily  integrated  into  the 
speaker  synthesis  unit,  providing  the  user  with  a  virtually  unlimited 
number  of  synthetic  speakers  and  virtually  unlimited  speech  synthesis 
capability. 

As  with  any  good  laboratory  tool,  the  uses  of  such  a  system  are 
innumerable.  It  could  be  u'fd  in  testing  speaker  verification 
devices,  as  per  its  original  intent,  speech  recognition  devices,  voice 
communication  and  bandwidth  compression  systems,  computer  simulations 
of  the  above,  and  so  on. 
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APPENDIX  A 
DATA  CAT  STATISTICS 

A . 1  NUMBER  OF  ATTEMPTS  AND  SUBJECTS  FOR  TYPE  I  ERROR  TESTIN6 

The  purpose  of  this  section  is  to  answer  a  question  posed  by 
paragraph  A. 1,1.1.  of  the  Data  CAT  Statement  of  Work.  "Determine  the 
number  of  samples  per  individual  and  the  number  of  separate  sessions 
per  individual  required  to  determine  a  Type  I  error  of  ,C1  with  a  90% 
and  95%  confidence  level." 

We  first  assume  there  is  no  variability  of  the  attribute  or  its 
measurement  process  from  session  to  session.  In  this  simplified  case 
we  will  find  the  number  of  samples  required  in  the  data  base.  The 
question  of  how  many  sessions  are  required  will  be  addressed  in  a 
later  section. 

Let  the  data  base  consist  of  N  samples.  An  identity  verification 
device  is  to  be  tested.  The  result  is  an  acceptance  or  a  rejection. 
Because  of  the  assumption  that  there  is  no  session-to-session 
variability,  there  is  a  single  constant  probability,  p,  that  a  sample 
will  be  rejected  by  the  test.  After  all  N  samples  are  tested,  H  will 
have  been  found  to  be  rejected.  The  problem  is  to  estimate  p,  the 
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Type  I  error  rate.  The  l i ke l i hood' f unc t i on  for  p  is  (Reference  1,  p. 
196) : 


,  j  \  M  ,N-M 
L(p)  =  p  (1-p) 


(Al) 


and  the  most  likely  estimate  of  p  is  p*,  that  value  of  p  which 
maximizes  the  log  l: 


which  is,  of  course,  the  intuitive  estimate  for  the  Type  I  error  as 
well. 

A  confidence  interval  about  p*  is  cefind  by  a  single  parameter 
A p,  which  is  said  to  provide  a  confidence  level  of  C  when 

p*  +  Ap  1 

C  =  /  Ldp  /  /  Ldp  (A4) 

p*  -  Ap  0 

Equation  A4  means  that  the  statement  "The  Type  I  error  has  a  value 
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p*  +  A p . "  will  be  true  1GGC  percent  of  the  time. 

Because  of  the  shape  of  L,  for  certain  values  of  M  and  N,  it  will 
not  be  possible  to  find  a  single  Ap  which  provides  the  desired 
confidence  and  at  the  same  time  keeps  both  p*  +  Ap  and  P*  -  Ap 
within  the  known  range  of  p  (from  0  to  1).  In  this  case/  a  logical 
interpretation  of  Equation  A4  is  to  replace  p*  +  Ap  with  1  or 
p*  -  Ap  with  zero,  depending  upon  which  limit  was  exceeded. 

Ue  would  now  like  to  plot  L  for  a  reasonable  value  of  M  and  N  in 
order  to  gain  some  insight  into  its  behavior.  What  is  a  typical  value 
of  N?  This  is  turning  the  problem  about  the  other  direction. 
Previously  we  have  been  considering  a  best  estimate  for  p  given  M. 
Now  we  want  to  know  a  typical  M,  which,  of  course,  can  only  be 
answered  by  knowing  p.  The  value  of  p  of  interest  for  this  study  is 
.Cl.  Thus,  we  now  ask  for  the  most  likely  value  of  M  given  p. 
Clearly 


M*  -  pN  (A5) 

and,  in  fact,  the  probability  that  any  value  of  K  will  be  observed  is 
given  by  the  binomial  distribution 


rr  ( M )  = 


M! (N-M) ! 


(1-p) 


Taking  U = 1 C C  samples,  we  find  a  most  likely  value  of  M  to  be  1.  A 
plot  of  l(p)  for  N=1GC  and  M=1  is  shown  in  Figure  A1.  Note  that  if  a 
confidence  of  .95  were  specified,  the  interval  about  p*  would  be 
asymmetric.  The  lower  value  of  p  would  be  zero  while  the  upper  would 
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The  above  discussion  should  make  it  clear  that  confidences  are 
defined  on  intervals  of  p.  The  problem  before  us  is  to  find  a  sample- 
size  which  will  provide  testing  to  a  specified  confidence  for  p=.C1. 
This  is  insufficient  information,  as  en  interval  about  .Cl  must  also 
be  set.  Uhat  is  a  reasonable  value  for  the  interval?  That  is,  what 
accuracy  on  p  is  desired?  We  submit  that  if  the  value  of  p  required 
on  the  identity  verification  system  is  .01,  then  one  would  like  to 
distinguish  between  ihe  case  p  =  .Cl  and  p  *  .02.  (One  is  not  really 
interested  in  knowing  that  p  =  . C 0 9 2 1  +/-. COCCI,  for  example,  although 
given  sufficient  samples  this  level  of  accuracy  could  be  achieved.) 
Thus,  a  reasonable  value  of  Ap  for  p  *  .01  is  Ap  =  .C05.  This 
will  permit  the  IX  and  2’/.  Type  I  error  cases  to  be  distinguished. 

The  question  which  we  have  set  out  to  answer  may  now  be  posed. 
"What  sample  size  N  is  required  to  permit  p  in  the  neighborhood  of  .G1 
to  be  determined  to  •*■/'—.  C  0  5  with  a  confidence  of  9CX  or  95X."  It  is 
clear  from  Figure  A1  that  n=1CG  samples  is  insufficient.  As  h  grows 
larger,  with  K=pN=.ClN  fixed,  L  approaches  a  Gaussian  shape, 

L(p )  exp  ( - ( p-^-) 2/2M(  1  -|j-) )  (A7) 


Equation  A?  is  easily  oerivec  by  expanding  log  l  in  a  Taylor  series 
about  p* .  From  the  normal  curve  of  error  9CX  of  the  area  is  contained 
within  1.645  of  the  standard  deviation  and  95’/  within  1.96.  Thus,  we 
require  a  value  of  K  such  that 

1.645  \J M(l-J)  <  \  (.01)N  (A8) 
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Inserting  h  *  .01N  gives 


1.645  \J(  ,01)N( .99)  <  .005  N  (A9) 

1071.6  <_  N  (A10) 

The  95%  confidence  value  is  1521. 3<N 


The  estimates  based  on  assuming  a  Gaussian  shape  can  be  refined 
to  exact  answers  by  performing  the  integration  of  Equation  A4.  The 
integrals  can  be  written  as: 


C  "  tBp*+AP  (M+1’  N'M+1)  ’  Bp*-Ap  (M+1,  N*M+1)^B(M+1»  N-M+l)  (A11) 

Where  and  b  are  the  incomplete  and  complete  beta  functions  C2, 
p.2653.  Using  an  expansion  for  B  good  for  small/  non-zero  values  of 

X 

x  C2/  p.9443,  we  can  compute  a  confidence  table/  Table  A1  for  p  *  .01. 
Taple  A1  is  used  by  finding  a  confidence 

interval  of  interest  in  the  top  row  and  a  number  of  samples  in  the 
left  hand  column.  The  intersection  of  row  and  column  gives  the 
confidence  value.  Ue  have  recommended  an  interval  of  +_  .  005  about 
p  »  .01.  This  is  tabulated  in  the  second  column.  From  Table  A1  we 
see  that  1200  samples  would  produce  a  90%  confidence  on  this  interval 
and  1£0C  would  produce  95%  confidence. 

The  above  discussion  has  made  no  mention  of  the  number  of 
different  individuals  included  in  the  study.  It  simply  says  that  a 
binary  decision  making  device  must  be  tested  1200  times  to  establish 
that  4  p  =  +_  .  C  G  5  when  p  =  .Cl.  Suppose  now  that  the  access  system 

is  testeo  on  two  people.  The  requirement  that  the  system  perform  at 
p  =  .01  can  be  interpreted  in  two  different  ways.  The  Type  I  error 
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Confidence  Values  for  Selected  Number  cf  Samples 
and  Intervals  About  p  =  .01  for  the  Binomial  Distribution 


P 

Interval 

C.CQ75, 

C . CCS/ 

CC,  .023 

00/ 

.03] 

CC,  .CAD 

ti  Samples 

.0125] 

.015] 

10C 

10.3 

05.  A 

59.7 

00.3 

90.7 

200 

26.2 

50. 1 

76. £ 

9A.2 

92. £ 

500 

Al  .5 

72. C 

53.5 

5 

5.7 

1000 

56. £ 

£7.  G 

99.  C 

1100 

£0.7 

99.3 

1  2C0 

61.1 

9C.2 

99.  A 

1500 

66.5 

93.  A 

99.0 

1600 

6  £ .  C 

5 A.  2 

99.9 

1700 

69.5 

9A.2 

1800 

70.9 

95. A 
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could  be  the  average  system  performance  or  it  could  be  the  performance 
requirement  on  each  individual.  That  is/  if  Individual  1  is  tested 
10,000  and  rejected  175  and  Individual  2  is  rejected  25  times  out  of 
1C, COG,  then  the  system  performance  is  p  *  .01  in  the  average  sense 
even  though  Individual  2  has  shown  a  p  *  .0175. 

Which  of  the  two  interpretations  above  makes  the  best  sense  for 
testing  an  access  device?  It  seems  obvious  that  in  a  large  human 
population  one  will  always  be  able  to  find  a  subjet  whose  measurements 
are  sufficiently  variable  to  reproduce  a  p  greater  than  .01.  For  an 
acceptable  access  conrol  system,  however,  the  number  of  such  subjects 
should  be  vanishingly  small.  Thus,  the  interpretation  of  a  p 
specification  as  o  system  average  is  the  sensible  one.  It  follows 
that  the  number  of  samples  we  have  computed  is  the  total  samples  for 
all  individuals,  not  the  number  of  samples  per  individual. 

The  foregoing  argument  would  make  it  appear  that  1200  samples 
could  be  drawn  from  12C0  subjects,  one  sample  per  subject  (in  addition 
to  the  samples  needed  for  enrollment).  We  would  now  like  to  show  that 
there  is  a  more  realistic  lower  bound  on  the  number  of  samples  per 
subject. 

What  will  determine  the  number  of  subjects  in  the  study?  First 
of  alt,  we  note  that  fewer  subjects  in  the  system  permits  economy  of 
data  collection  and  storage  because  a  fixed  number  of  samples  per 
subject  must  be  collected  for  enrollment.  Say  ICC  enrollment  samples 
are  collected  for  12CC  subjects.  The  enrollment  data  base  is  120,000 
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samples  while  the  total  data  base  of  enrollment  and  test  data  bases  is 
only  12CC  greater/  continuing  the  example  of  one  test  sample  per 
individual*  At  the  opposite  end  of  the  scale/  if  only  one  individual 
iSvto  be  used  for  the  data  base  with  12GC  test  samples  for  him,  then 
the  total  data  base  consists  of  only  1 3 CO  samples.  It  is  also 
apparent  that  the  collection  of  data  from  one  individual  would  be 
easier  and  less  costly  than  from  12C0. 

Despite  the  foregoing/  it  is  obvious  that  the  data  base  must 
include  more  than  one  subject  because  the  population  of  subjects  will 
not  be  homogeneous  with  respect  to  the  attribute  being  measured.  In 
collecting  the  signature  data  base/  for  example,  we  know  a  priori  that 
there  are  two  fundamental  groups  of  subjects,  left-  anci  right-handed 
persons.  Subjects  must  be  drawn  from  all  groups  of  a  significant  size 
for  which  there  is  reasonable  probability  of  systematic  attribute 
variation.  Let  us  suppose  that  2CX  of  the  population  is  left-hsnded 
and  SOX  right-handed.  Then  a  possible  procecure  is  to  use  one  subject 
from  each  of  the  two  classes  and  to  collect  four  times  mr/e  samples 
from  the  r ight-hancier.  Since  the  average  p  will  be  computed  as  the 
weighted  sum  of  the  right-  handed  Type  I  error,  pR ,  and  the  left,  PL, 

p  -  .2  pL  +  .8  pR  (A12) 

error  in  contributes  more  to  error  in  p.  An  alternate  and 
superior  procedure  is  to  use  one  left-h3nced  subject  and  four  right. 
Then  an  equal  number  of  samples  should  be  collected  from  each. 
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That  subgroups  are  expected  in  the  population  demonstrates  that 
previously  undiscovered  subgroups  may  be  revealed  in  the  testing  of  a 
new  access  device.  This  is  an  additional  reason  why  there  must  be  a 
number  of  different  subjects  in  the  data  base.  For  example/  suppose 
the  data  base  consisted  of  samples  from  two  randomly  selected  subjects 
and  there  are  two  undiscovered  subgroups/  each  comprising  5C%  of  the 
population.  Further  suppose  they  have  s  Type  I  error  of  1.9  and  .1%. 
Two  times  out  of  four  the  individuals  in  the  data  base  would  include  a 
subject  from  each  subgroup/  one  time  in  four  it  would  include  two 
subjects  from  the  first  subgroup/  and  one  time  in  four/  two  from  the 
second.  Assume  enough  samples  per  subject  that  the  Type  I  error  for 
the  first  subject/  p^  ,  and  for  the  second/  p2 /  are  known  with  high 
precision.  Then  the  true  p  for  this  population/  Ptrue  /  is  1.0X. 
However/  because  of  too  few  subjects  in  the  data  base,  a  value  of  p 
different  from  Ptrue  can  result.  This  is  shown  in  Table  A2,  where 

the  three  cases  are  given 

at  the  left  of  the  Table,  each  with  its  probability  of  occurrence. 
The  value  of  p  which  would  be  computed  is  shown  in  the  column  labelled 
'p',  and  the  square  deviation  from  P-true  ’n  column.  The  rms 
deviation  is  .64%. 

Ue  now  consider  the  same  situation  more  generally.  Instead  of 
two  subgroups  with  discrete  value  of  p,  we  permit  a  contiuum  of 
possible  values  of  p.  Now  let  f(p)dp  be  the  fraction  of  the 
population  having  Type  I  error,  p,  between  p  and  p+dp.  We  want  to 
knew  how  many  subjects  to  include  in  our  sample  in  order  to  prevent  a 
widely  spread  distribution  f  from  affecting  the  results.  Again, 
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TABLE  A2 
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assuming  that  enough  samples  are  taken  from  each  subject  so  that  the 
value  of  p  for  the  subject  may  be  determined  with  ignorable  error,  we 
can  state  t^iat  the  average  value  of  p  given  by 
1 

p  =  /  pf(p)dp  (A13) 

0 

is  the  value  of  p  for  the  whole  population  and  is  therefore  the  value 
we  would  want  our  sample  to  represent.  Thus,  we  need  to  have  enough 
subjects  so  that  ?>  is  determined  with  small  error.  The  accuracy  of  an 
estimate  of  "p  is  also  determined  by  the  variance  of  the  distribution 
f.  In  fact,  from  the  Central  Limit  Theorem  we  can  state  that  the 
error  in  a  determination  of  p,  o,  is  given  by 


(A14) 


as  K  grows  large.  Here  Apis  the  standard  deviation  of  p  due  to  the 
distribution  f. 


Ap2  =  /  (p-p)f(p)dp  ( A15 ) 

0 

Actually,  the  type  of  distribution  which  produces  the  largest 

A  p,  and,  hence,  according  to  Equation  A14  the  largest  o  is  a 
binomial  distribution  of  the  sort  we  considered  in  the  example  of 
Table  A2.  Ue  will  make  this  worst  case  assumption  in  order  to 

establish  an  upper  bound  on  K.  Let  f  1  be  the  fraction  of  the 

population  belonging  to  Subgroup  1.  Let  be  the  Type  I  error  of 

this  subgroup.  Let  f9  and  Pj  be  the  corresponding  quantities  for 
Subgroup  2.  K  subjects  are  selected  randomly.  The  true  value  of  p 
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for  the  population  is: 


P  =  tlh  +  f2P2 


(A16) 


A  particular  draw  of  K  subjects  will  consist  of  v  members  of 
Subgroup  1  and  K-  v  of  2.  The  probability#  *  ,  of  this  event-  is 


ir(v)  =  (H)  f2k‘V 


(A17) 


The  resulting  Type  I  error  which  would  be  measured  is 


p  =  p2  +  £  (Pi'P2) 


(A18) 


O2  -  l  ir(v)  (p-p)2 


(A19) 


which  gives 


°  ■  |prp2 


( A20) 


Fixing  p  at  .01#  from  Equation  A16 


•01-flPl 


P2  =  1-f, 


( A21) 


giving 


.01  -  p 


W—  - 


( A22 ) 


The  worst  case  value  (large  o  )  is  produced  by  f^  near  1.  Since 
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P2  is  less  than  pr  equal  to  one,  from  A21 


f. 


(A23) 


Thus,  the  worst;  case  value  of  is  for  p  *  0  and  fx  ■ 

Inserting  these  values  into  A22  gives 


o  <  .01 


(A24) 


Using  again  the  requirement  that  the  sample  size  should  be  sufficient 
to  distinguish  a  Type  I  error  of  .Cl  from  that  of  .C2, 


o  = 


.005 


(A25) 


yields 

396  <  k 


(A26) 


Table  A5  shows  the  probabilities  of  measuring  certain  p  values 

when 

4C0  subjects  are  used  and  the  worst  case  assumptions  are  made. 
Observe  that  a  value  pf  p  *  1.5C5S  or  less  is  obtained  C9X  of  the  time. 

Unfortunately,  the  result  K=4GC  is  rather  a  large  number  of 
subjects  to  include  in  the  data  base.  This  large  number  has  arisen 
due  to  the  fact  that  we  have  postulated  e  subgroup  comprising  only  IX 
of  the  population.  If  we  were  to  relax  the  specifications  so  tnat 
only  subgroups  of  5X  or  more  would  be  of  concern,  then  f 


can  be  sc-t 


TABLE  A3 


Two  Subgroups  Assumed#  With  Type  I  Errors  0.0  and  1.0 
And  Frequency  of  Occurrence  .99  and  .01. 


Table  shows  probability  of  occurrence  for 
varioqs  p  values  for  400  subjects  in  sample. 


V 

ir<v ) 

P 

400 

.01795 

0 

399 

.07253 

.  25% 

398 

.14615 

.  50% 

397 

.19585 

.75% 

396 

.19635 

1.00X 

395 

.15708 

1.25X 

394 

.10446 

1.50% 

393 

.05939 

1.75% 

DATA  CAT  STATIST*?? 


to  .95.  Using  again  the  worst  case  values  of  *  0.0  and  p2  *  0.2, 
.005  =■  0  <  (A27) 

or 

28  <_  k  (A28) 

Similarly,  if  subgroups  no  smaller  than  10%  of  the  population  are 
considered 

8  <  k  (A29) 
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A. 2  NUMBER  OF  ATTEMPTS  AND  SUBJECTS  FOR  TYPE  II  ERROR  TESTING 

The  purpose  of  this  Section  is  to  discuss  the  number  of  attempts 
which  must  be  made  against  an  identity  verification  device  to 
ascertain  its  Type  II  error  performance  to  certain  confidence  Levels. 


Ue  first  define  the  Type  II  error  rate.  Let  a  be  an  index  over 
the  population  to  be  enrolled  in  the  system.  An  individual  who  is 
enrolled  will  have  an  ' occoun t '  which  will  contain  the  personal  data 
against  which  he  will  be  compared.  When  another  individual,  i,  makes 
a  verification  attempt  against  account  a  ,  an  opportunity  for  a  Type 
II  error  arises.  The  probability  that  individual  i  will  be  accepted 
under  account  a  will  be  denoted  pai  .  By  letting  i  run  over  ell 
Members  of  the  population  which  might  attempt  access,  we  could  obtain 
the  Type  II  error  rate  of  account  a. 


P 


a 


1 

M 

TOT 


ntot 

l 

i-1 


( A30) 


where  NTQT  is  the  size  of  the  intruder  population. 

In  Section  A.1  we  discussed  for  Type  I  errors  whether  a 
specification  on  the  error  rate  should  be  a  rigid  bound  on  all 
accounds  or  an  average  over  all  accounts.  Ue  demonstrated  that  only 
the  tatter  made  sense.  Correspondingly,  we  here  adopt  a  cefinition  of 
the  identity  system  Type  II  performance  as  an  average.  The  Type  II 

error  rate  is  defined  as 

N1 

1  TOT 

p  =  .  >: 
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where  N'  is  the  size  of  the  population  of  individuals  who  would 
potentially  be  enrolled  in  the  system. 


Since  the  goal  of  Data  CAT  is  to  collect  a  general  data  base, 
neither  intruder  nor  enrollee  population  can  be  specified  exactly.  Ue 
take  both  to  be  the  entire  American  population.  Thus, 


ntot 

l  l 
0  o 


p«e 


a*0 


(A32) 


where  the  two  populations  are  considered  to  be  the  same  and  an 
individual  is  eliminated  from  the  Type  II  statistics  against  his  own 
account  by  deleting  the  a*  6  term. 

Altogether/  two  random  variables  must  be  adequately  sampled  in 
compiling  the  Data  CAT  data  base.  There  should  be  enough  access 
attempts  that  the  individual  terms  p  are  accurately  estimated,  and 

op 

there  should  be  sufficient  account-intruder  pairs  that  the  population 
is  adequately  sampled. 

Despite  the  foregoing,  to  simplify  the  discussion  we  first  assume 
all  accounts  and  intruders  are  equivalent.  We  have  a  single  account 
and  a  single  intruder.  We  want  to  know  how  many  samples,  N,  of  the 
intruder  are  required  to  test  a  system  which  performs  with  a  Type  II 
error  near  a)  .02  and  b)  .CGCG1.  The  intruder  is  either  accepted  or 
rejected  so  the  binary  statistics  developed  in  Section  1.0  can  be 
used . 
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Table  A4  gives  the  confidence  for  a  system  with  p  =  .02.  For 
example, 

if  the  data  base  contained  2GC  samples  and  four  were  falsely  acceptea, 
then  the  Type  II  error  would  be  11.  Furthermore,  by  examining  the 
Table,  one  sees  that  the  assertion  that  p  =  .C2  .01  has  a  til 
confidence.  Using  this  error  interval  as  the  most  reasonable  choice, 
we  can  state  that  £CC  samples  are  required  for  9GX  confidence  and  1C00 
for  95 X.  Table  A5  provides  the  same  information  as  Table  A  4  for  a 
system  wi ?h 

p  *  . 0C1 % .  Here  we  see  that  for  the  preferred  choice  of 

p  *  .  U  ri  l  %  ^  .  0005 1,  1.2  x  1  0  samples  are  sufficient  for  90 1 

0 

confidence,  but  even  1.5  x  1C  samples  are  insufficient  to  achieve  a 
confidence  of  $52.  Comparing  to  Table  A4  we  estimate  a  requirement  of 
1.S  x  1 06  samples. 


l.'e  now  extend  the  argument  to  consicer  the  fact  tnat  different 
ac coun t- i nt rud er  pairs  will  have  different  values  of  p.  We  must  have 
a  sufficient  number  of  pairs  to  sample  the  population  adequately. 
This  question  was  also  considered  in  Section  1.C  under  the 
"undisclosed  subgroup  problem."  There  we  showed  that  the  greatest 
danger  of  biased  sampling  occurred  for  two  subgroups,  one  with  p^=  C 
comprising  9Z1  of  the  population  and  one  with  p2  =  1.CC  comprising  cl. 
This  produces  a  p  equal  to  .02  but  a  large  sample  of  individuals  is 
required  to  reduce  the  fluctuations  in  the  number  of  members  of  the 
poorly  performing  subgroup  included  in  the  sample.  From  Equations  A16 
and  A22, 


o 


.02 


l 


.98 


1-.98 


<  .01 
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TABLE  A4 


Confidence  Levels  for  a  System 
With  Type  II  Near  2X 


Interval  .005  .01  .02 

p  «  .02  t 


N 

Samples 


100 

27.0 

50.5 

77.0 

200 

38.0 

67.0 

90.7 

5Q0 

57.1 

87.3 

99.0 

800 

68.3 

91.1 

99.9 

000 

73.7 

96.5 

99.95 

500 

82.9 

98.9 

99.96 

TABLE  AS 


i 


Confidence  Levels  for  a  System 


With  Type 

II  Near  .001% 

Interval. 

.25  x  10-5 

.5  x  1C"5 

10"5 

p  *  10'5  + 

N 

Samples 

10* 

18.2 

35.2 

59.4 

1°6  , 

56.5 

86.8 

98.9 

1.2  x  10® 

60.8 

89.9 

99.4 

1.5  x  10 

66.2 

93.2 

99.8 
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Thus,,  the  number  of  pairs  required#  K#  is 

196  £  k  (A34) 

For  the  device  performing  at  p  *  .C01%#  359996  pairs  would  be 

required . 

Notice  that  the  population  of  intruders  and  accounts  cannot 

overlap.  However#  a  populet  on  of  enrolled  individuals  must  be 

✓ 

collected  for  Type  I  testing#  aiyway.  The  intruders  could  be  drawn 
from  a  subset  of  this  group.  That  is#  suppose  K'  individuals  are 
collected  for  Type  I  error  testing.  Let  their  accounts  be  numbered 
1#  2#  ...#  K*.  Use  Account  1  and  run  individual  2,  3#  ...#  K '  as 
intruders.  Use  Account  2  and  run  the  K  *  —  2  remaining  individuals  as 
intruoers.  Notice  that  Individual  2  is  run  against  Account  1  but 
not  conversely.  Choosing  both  combinations  would  not  constitute  an 
independent  sample  from  the  universe  of  all  possible  pairs  even  though 
p^g  is  not  necessarily  equal  to  p  gQ.  Proceeding  in  this  fashion#  one 

obtains  (K'-1)K'/2  pairs.  Thus#  with  K'  individuals  in  the  data  base# 

the  number  of  tests  which  may  be  performed#  K#  is 

~  k'^ 

k  =  ~  ( A35 ) 
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For  Type  II  error  of  VL,  using  Equations  ASA  anci  ASS/ 
approximetely  2C  individuals  are  required.  For  an  error  of  .CC12, 
approximately  895  individuals  are  required. 


Just  as  the  total  number  of  pairs  is  a  quadratic  function  of  the 
number  or  individuals/  the  total  number  of  samples  required,  h,  is  a 
quadratic  function  of  the  number  of  samples  per  individual,  n.  Tnus, 


N 


(A36) 


2 

For  example,  we  know  that  S59996  (  =  K'  / 2 )  pairs  are  required  for 
p  *  .0015i.  Also,  1.2  x  1 G6  (*(;)  tests  are  recuired  to  establish  s 
9CX  confidence  on  the  recommenced  interval  +  .CCC5*.'.  Thus,  inserting 
into  Equation  AS fc. 


n  =  3 


( A37 ) 


Table  A  6  summarizes  the  requirements  on  individuals  end  samples 
i  n 

the  data  base,  de  observe  that  a  requirement  for  ACo  individuals  to 
achieve  a  Type  I  error  rate  of  15:  is  more  demanding  than  the 
equivalent  twenty  individuals  for  a  2':  ype  II. 
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TABLE  A6 


a 


Number  of  Persons 

Samples/Person 

Type  II  Error 

K' 

90X  95* 

2* 

20 

A  5 

.oou 

895 

3  5 

1 

a 


| 

1 

I 

1 


j 

i 

'H 

•i 
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A. 3  NUMBER  OF  SAMPLES  FOR  ENROLLMENT 

Currently  available  identity  verification  systems  measure  a 
personal  attribute  of  a  test  subject  and  compare  the  measured 
attribute  with  a  previously  stored  reference  file  for  the  subject. 
This  operation,  the  verification  process,  thus  requires  a  reference 
file  for  each  user.  The  reference  file  is  created  when  the  user  is 
enrolled  in  the  system,  but  may  be  updated  with  subsequent 
measurements  from  verification  attempts. 

The  purpose  of  Data  CAT  is  to  design  a  data  base  of  speech, 
fingerprint,  end  handwriting  attributes  which  will  permit  testing  of 
potential  identity  verification  devices.  The  data  base  must  contain 
measurements  for  both  enrollment  and  verification.  Typically,  at 
enrollment  several  repeated  measurements  are  performed.  For  example, 
a  subject  in  the  handwriting  system  would  sign  his  name  several  times 
in  order  to  establish  a  representative  pattern.  The  Data  CAT  data 
base  should  be  general  enough  to  accomodate  a  wide  class  of 
verification  systems,  and,  therefore,  the  enrollment  portion  must 
contain  more  than  one  measurement  of  the  attribute.  Each  measurement 
will  be  called  a  sample.  This  Section  will  discuss  the  number  of 
samples  which  should  be  collected  for  the  enrollment  portion  of  the 
data  base. 
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Alt  identity  verification  devices  currently  under  consideration 
by  the  Air  Force  work  in  a  fashion  .which  is  easily  described  in  the 
language  of  linear  pattern  recognition.  When  the  personal  attribute 
a  i*v; eot'tected^  •  certain  ke-y  features;  believed  particularly  individual 
c«s6(f2'St5able;?  are*  extracted.  The  complex  attribute  is.  thus  reduced7  to  a 
.-simpler  set  of  features.  Let  us  suppose  that  A  measurements  are 
"'••Slide#  the  fihst  measurement  having  value  x  ,  etc.  The  measurements, 
A ''  assembled1''  as  a  Sector,  x, 

‘I  f«V  M  ’  . 

x  =  <  x^,  x2,...,  xA  >  ,  (A38) 

comprise  the  feature  yeetor.  In  linear  pattern  recognition  the  vector 

.  *  */  1  ' 

x  is  treated  as  a  point  in  a  A  -dimensional  linear  space.  When  the 
attribute  for  the  subject  is  measured  a  second  time,  due  to 
measurement  noise,  statistical  fluctuation,  or  actual  change  in  value, 
the  feature  vector,  x,  will  be  different.  However,  if  th:  personal 
attribute  is  useful  for  identification  end  the  features  are  well 
constructed,  then  all  the  vectors  for  a  particular  subject  should  be 
relatively  close  together  and  relatively  far  from  vectors  belonging  to 
a  different  subject.  A  metric  is  obviously  needed  to  formalize  the 
notion  of  distance.  Figure  A2  shews  a  two-dimensional  space  with 
feature  vectors  for  two  subjects. 

At  enrollment  a  reference  file  for  a  subject  is  created.  This 
reference  file  is  a  means  of  specifying  that  region  or  regions  in 
feature  space  which  are  likely  to  contain  vectors  for  the  subject.  At 
verification  a  newly  acquired  feature  vector  is  tested  to  sec  whether 
it  lies  in  an  acceptable  region  for  the  subject,  and  he  is  accepted  or 
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Figure  A2  A  two-dimensional  feature 

space  containing  measurement 
vectors  for  two  subjects,  one 
represented  by  dots,  the  other 
by  crosses. 
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rejected  accordingly. 

A  number  of  methods  for  specifying  the  so-called  decision 
boundaries  of  regions  are  in  common  use.  In  general,  there  are  simple 
methods  which  involve  few  parameters  and  represent  the  regions  by 
relatively  simple  shapes  and  methods  employing  numerous  parameters, 
thereby  capable  of  representing  more  complex  shapes.  Each  parameter 
which  is  used  to  specify  a  region  must  be  established  before  a 
decision  strategy  can  be  implemented.  In  an  identity  verification 
system  the  region  parameters  are  estimated  at  enrollment  by  using  the 
repeated  measurements  of  the  attribute  under  consideration.  The 
accuracy  with  which  a  region  can  be  specified  will  depend  both  on  the 
number  of  parameters  needed  and  on  the  dimensionality  of  the  space. 
Moreover,  the  number  of  samples  of  an  attribute  which  are  available  to 
estimate  the  parameters  directly  affects  the  accuracy  of  region 
representation. 

For  Data  CAT  we  need  to  determine  the  number  of  samples  of  an 
attribute  which  might  be  required  by  a  future  identification  device. 
The  answer  to  this  question  depends  on  the  dimensionality  of  the 
feature  space  and  on  the  complexity  of  the  region,  both  of  which  is 
impossible  to  describe  without  previously  specifying  the  device.  We 
can,  of  course,  make  estimates  of  the  maximum  dimensionality  permitted 
in  the  data  for  the  respective  attributes.  Furthermore,  we  could 
postulate  commonly  used  region  shapes  (or  decision  strategies).  We 
postpone  this  ultimate  question  for  the  present  and  consider  a  few 
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well  known  decision  strategies  in  order  to  elucidate  the  interplay  j 

between  complexity  of  a  strategy  and  number  of  samples  required.  I 

The  simplest  strategies,  or  'logics'  as  they  are  sometimes  called  I 

to  emphasize  their  decision  making  role,  all  assume  a  single  simply  j 

$ 

connected  region.  A  straightforward  logic  is  to  assume  that  the 
regions  for  all  individuals  may  be  represented  by  a  simple  geometric 
figure  such  as  a  hypersphere,  hypercubo,  or  hype r- rec tang  le .  Tne  size 
and  shape  of  the  geometric  figure  are  fixed,  only  its  location  need  be 
ascertained  by  samples  of  feature  vectors  for  the  subject.  Figure  A 2 
shows  such  a  decision  strategy  for  three  individuals. 

How  many  samples  are  required  to  center  the  decision  box? 

Suppose  for  the  present  that  A  si  and  n  measurcm$nt\  are  maae:  x^, 
x  (2),  ...,  x^n\  Then  the  average  value  of  x,  wh'eAfc  the  box  should  be 
located  is:  ;Y 


-  1  v 
x  =  -  a 

n  a 


(a) 


(A39) 


The  best  estimate  for  the  error  in  each  measurement  is 


S, 


H 


Ax  = 


£  (*(0>  -  x)2 


n-1 


(A40) 


and  the  best  estimate  of  the  deviation  of  x  from  the  true  mean  is 


AT 


(A41) 


where  the  estimate  of  the  mean  given  by  Equation  A19  will  lie  within 
Ax  of  the  true  mean  with  probability  .6C2.  The  observed  x  will  be 
required  to  be  smaller  than  some  bounc  5  (presumably  related  tc  the 
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Measurement  2  < 


Measurement  1 


Figure  A3  Simple  decision  logic  utilizing 
fixed  geometric  chapes.  A  mea¬ 
surement  vector  lying  inside  box  A 
is  accepted  as  Subject  A,  etc. 
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characteristic  size  of  the  box)  in  order  to  make  the  estimated  mean/ 
"x,  close  to  the  true  mean/ 

*  <  B  (A42) 

Equations  A4G,  A41,  and  A42  give  an  operational  test  to  determine 
whether  enough  measurements  have  been  made.  Now  suppose  x  is 
two-c imensi ona  1 .  Then  the  mean  and  standard  deviation  for  each 
component  are  calculated  as  above.  But  to  guarantee  that  both 

components  have  a  standard  deviation  close  to  the  true  mean  will 
require  more  measurements.  Suppose  we  require  that  x^  and  1<2  lie 
within  81  and  62  of  their  true  mean  with  probability  .682,  Then  we 
must  require  that  x  ^  and  "x2  lie  within  the  bounds  with  probability 
.826.  In  general/  each  component  must  satisfy  its  bounds  with  a 
probability  .6 to  make  the  joint  probability  .6S2.  For  example, 
if  we  make  sufficient  measurements  that  Ax/B  £  .5,  then  x  will  lie 
within  B  of  the  true  value  with  probability  .954,  since 

l  2  2 

.954  =  2  (  -~  /  exp(-£  )  dy  =  erf  ~  )  ,  (A43) 

2\AT  0  V2 

or  in  general, 

(.682)1/A  =  erf  )  .  (A44) 

V2  Ax 


Substituting  Equation  A4C  gives 
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o 

The  value  of  n  in  units  of  (0/Ax>  for  some  representative  choices  of 
are  given  in  Table  A7. 

Another  method  for  specifying  a  region  is  to  estimate  not  only 
the  location  of  a  simple  geometric  figure#  but  also  its  size.  Figure 
A4  shows  such  a  use 

of  ellipses.  For  each  ellipse  the  location  and  width  of  the  ellipse 
must  be  determined  by  sampling#  for  a  total  of  2  A  parameters  pgr 
class.  Using  Equation  A45  with  A  replaced  by  2  A  produces  Table  AS. 

Another  common  method  of  specifying  a  logic  is  to  permit  the 

geometric  figures  to  have  arbitrary  size  and  orientation  in  addition 

o 

to  location.  In  this  case  A  parameters  arc  used  for  size  and 

orientation  for  A  end  location.  Using  Equation  A45  with  A 

2 

replaced  by  A  +  A  produces  Table  A9. 

Another  condition  which  may  be  plac>o  on  the  number  of  samples 
required  is  that  the  estimated  parameters  be  linearly  i ndepene’ent . 
This  can  occur  only  if  the  number  of  samples  numbers  is  greater  than 
the  number  of  estimated  numbers.  With  n  samples  of  dimension  A #  nA 
numbers  are  available.  Thus#  at  least  one  sample  is  require-c  for  a 
A  parameter  logic#  two  are  required  for  a  2  A  logic#  and  A  +  i  ore 
required  for  a  A  +  A  logic. 

In  conclusion#  we  observe  that  the  number  of  samples  required  to 
establish  the  parameters  of  r  region  is  depend mt  on  the  type  of  logic 
employed  and  on  the  dimensionality  of  the  feature  space.  However#  for 
logics  of  the  first  tv.o  types  considered,  in  which  the  number  of 
parameters  to  be  estimated  is  e  linear  function  of  A  #  even  for  large 
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TABLE  A 7 


Logic 

With 

A  Parameters 

Per  Class 

.A 

1 

10 

100 

1000 

/  3x2 

(s?>  " 

1 

4.28 

8.32 

12.7 

Logic 

With 

TABLE  A8 

2A  Parameters 

Per  Class 

i 

A 

1 

10 

100 

1000 

/  V  n 

1 

1.84 

8.71 

17.1 

Logic  With  A2 

TABLE  A9 

+  A  Parameters  Per  Class 

1 

10 

100 

n 

1.84 

8.71 

17.1 

A- 3  3 
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A  the  number  of  measurements  is  not  that  excessive. 

We  now  turn  to  a  consideration  of  the  maximum  dimensionality 
available  in  the  attributes  considered.  In  the  case  of  speaker 
identification,  we  shall  presume  that  an  utterance  of  two  seconds  is 
used  consisting  of  approximately  twenty  different  phonemes  C3D.  Each 
phoneme  can  be  characterized  by  a  few  numbers  such  as  voiced/unvoiced, 
pitch,  and  formant  position,  bandwidth,  and  relative  amplitude. 
Altogether,  some  twenty  numbers  are  perhaps  sufficient,  leading  to  an 
estimate  of  A  =  4GG  for  a  two  second  utterance.  It  is  interesting  to 
compare  this  to  the  number  of  bits  necessary  to  encode  the  utterance. 
Using  either  a  channel  or  IPC  vocoder,  approximately  200C  bits/second 
are  required  for  good  quality  cpeech  C4]. 

Whereas  voices  are  adequately  decomposed  into  formants,  no  such 
set  of  features  has  even  been  devised  for  fingerprints,  much  of  the 
information  content  of  a  print  resides  in  the  minutiae,  however. 
Assuming  four  numbers  per  minutia  (two  for  location,  one  for 
direction,  and  one  for  type)  and  100  minutiae  per  print  yield  an 
estimate  of  a  =  4G0C  for  all  ten  fingers.  Encoding  of  fingerprints 
requires  some  6G,0GG  bits  per  digit  C 5 D . 

In  the  case  of  signatures,  even  less  is  known,  and  neither 
estimates  of  the  number  of  features  nor  the  number  of  bits  for 
encodino  are  available  in  the  literature.  Assuming  an  average 
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signature  to  be  perhaps  five  inches  in  length  when  integrated  along 

ft 

its  arc  and  assuming  a  resolution  of  .01  inches,  then  the  spatial 
information  might  require  some  1CDQ  numbers.  At  1C  bits  per  number, 
10G00  b.its  per  signature  would  be  required.  Finally,  doubling  this 
number  to  allow  for  a  pressure  variable  yields  2CCCC  bits.  This 

f 

number  is  an  upper  bound  since  many  portions  of  a  signature  consist  of 
line  segments  of  low  or  zero  curvature. 

Considering  the  current  speaker  verification  system  built  by 
Texas  Instruments  C62  to  bo  prototypical,  we  can  compare  the  number  of 
dimensions  utilized  and  the  number  of  enrollment  samples  required.  In 
each  utterance  four  reference  points  with  ICC  associated  numbers  are 
evaluated,  giving  a  dimensionality  of  4C0.  Since  these  reference 
points  concern  only  vowels,  not  all  phonemes  are  exploited.  Thus  the 
agreement  between  the  theoretical  and  actual  dimensionality  is  largely 
coincidental.  At  enrollment  time,  each  word  is  spoken  four  times. 

Ue  consider,  likewise,  the  fingerprint  verification  device  built 
by  CALS PAN  Corporation  C  7  3  to  be  typical.  Unfortunately,  the 
operation  of  the  device  is  not  described  in  open  literature.  Although 
print  matching  is  based  on  minutiae  (position  in  two  coordinates  and 
orientation),  the  number  of  minutiae  used  is  not  stated.  It  appears 
to  be  variable  depending  on  the  number  located  within  the  print,  with 
three  being  a  minimum.  Thus,  the  dimensionality  is  greater  than  nine. 
The  CALSPAN  device  requires  ten  enrollment  samples. 


A -It 
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Even  less  information  is  available  about  the  operation  of  the 
prototypical  signature  verification  system  built  by  Veripen  C83. 
Veripan  uses  six  signatures  for  enrollment. 

Table  A 1 C  summarizes  this  information.  The  conclusion  which  can 

S  be 

reached  from  Table  A1C  is  that  the  number  of  enrollment  samples  is 
consistent  with  the  dimensionality  presuming  a  simple  logic  is 
employed.  This  is  known  to  be  the  case  for  the  Texas  Instruments 
device  which  uses  the  following  simple  region  specification.  The 
logic  employed  is  to  require  that  the  measured  vector  x  lies  within  a 
distance  t  of  the  reference  vector  r.  That  is 

E  (xft  -  r  )2  <  t  (A46) 

The  variable  t  is  allowed  to  be  a  function  of  individual.  Equation 
A46  thus  defines  a  circle  of  variable  radius  in  feature  space  and  is  a 
very  simple  example  of  the  second  type  of  logic  which  we  discussed. 

Based  on  Tables  A7,  AC,  A9,  and  A1C,  it  would  appear  that  ten 
samples  token  at  enrollment  is  a  reasonable  number.  A  rather 
compelling  argument  for  using  such  a  small  number  is  the  observation 
that  no  practical  access  conrol  device  can  require  too  many  enrollment 
samples.  If  this  we  re  the  case  it  would  be  unacceptable  to  both  users 
and  agencies  deploying  it.  As  a  conservative  measure  to  guard  against 
possibly  unusable  data,  we  recommend  that  the  minimum  number  of 
enrollment  samples  be  doubled  to  twenty. 
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A. A  NUMBER  OF  SESSIONS  FOR  EACH  SUBJECT 

As  we  have  shown  In  Section  1.C,  N  samples  are  required  for  Type 
I  error  testing  and  K  subjects  must  be  included.  Thus,  at  least 
N •  «  N/K  samples  per  subject  are  required  in  the  data  base.  This  note 
is  concerned  with  the  question  of  how  many  sessions  should  be  used  to 
collect  the  N'  samples. 

As  a  basic  premise  we  assume  that  as  the  number  of  sessions 
increases,  the  cost  of  collection  will  go  up.  This  is  reasonable 
since  in  any  data  collection  there  are  the  overhead  expenses  of  set-up 
time,  travel  time,  subject  coordination,  end  general  organization.  In 
fact,  it  is  usually  the  case  that  the  time  devoted  to  overhead  items 
dominates  the  total  time  allocated.  Therefore,  we  should  minimize  the 
number  of  separate  sessions. 

It  will  not  ordinarily  be  possible,  however,  to  collect  all  the 
required  data  in  a  single  session  because  the  physiological  attributes 
being  measured  are  subject  to  long  term  variability  over  and  above  the 
short  term  variability  which  would  epear  at  a  single  session.  Let  u 
be  the  measurement  vector  and  let  H  (x)  be  the  distribution  of  x 
measured  for  the  subjects  at  the  ith  data  collection  session.  The 
long  term  distribution,  F(x),  might  be  found  be  averaging  the  single 
session  distribution. 
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The  distribution  F  is  normalized  if  each  is  normalized.  Figure  A5 
shows  how  a  set  of  different  H  ^  can  build  an  F.  The  vector  x  is 
shown  as  a  scaler  for  ease  of  presentation. 

We  now  postulate  that  the  session-to-session  variability  is  due 

**  * 

to  some  hidden  parameter  y.  For  example,  suppose  variability  in 
fingerprint  measurements  is  caused  by  variability  in  skin  moisture. 
Then  y  would  measure  moisture  content.  The  postulate  implies  that  for 
each  y  value  (y  is  a  vector),  there  is  a  unique  H(x,y).  As  different 
collection  sessions  are  concuctod,  y  will  vary  in  time  according  to  an 
unmeasured  lew  and  will  result  in  different  H  distributions.  That  is, 

if  y<t £>  is  the  value  of  y  at  the  time  of  the  ith  observation,  ti, 

then 

H(x,y(t? ))  =  H.lx)  (A48) 

Let  G(y)  be  the  temporal  density  function  of  y.  That  is  G(y)dy  is  the 
probability  that  a  random  sample  of  the  hidden  parameter  will  produce- 
a  value  between  y  ond  y  tdy.  Then 

+00 

F(x)  =  /  G(y )  H(x,y )dy  (A49) 

-.00 

A  hidden  variable  y  may  always  be  postulated.  Cne  may  take  y  as 
time  itself,  for  example.  However,  the  existence  of  a  distribution 
G(y)  which  can  be  normalized  is  an  assumption  which  we  make.  This 

assumption  is  equivalent  to  stating  that  the  long  term  variability  in 

the  parameter  x  is  bounded.  Since  time  is  bounced  in  the  access 
control  situation  of  interest  to  us  here,  the  function  Gly)  must 
always  exist.  The  trivial  case  is  when  no  hidden  parameter  ether  than 
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time  exists  and  G  may  be  taken  as  the  reciprocal  of  the  time  over 
which  the  data  collection  will  occur/  t  N  -t^  .  interesting  case  is 
when  y  returns  to  the  same  value,  making  G  a  non-constant  function  of 

y. 

In  Section  1.C  we  considered  the  problem  of  determining  the 
number  of  subjects  the  data  base  should  include.  We  used  the  notion 
of  an  undisclosed  subgroup  and,  as  an  extreme  example,  considered  that 
for  one  subgroup  the  Type  I  error  p  was  the  highest  possible  value, 
I.G.  We  then  argued  that  the  Type  I  error  quoted  for  a  verification 
system  should  be  the  average  over  the  population.  A  subgroup  with  a 
value  of  p  *  I.G  could  comprise  a  smell  fraction  of  the  population 
(namely  IX)  and  still  permit  the  access  device  to  meet  specifications 
if  the  rest  of  the  population  had  p  *  0.  If  a  data  base  were  to 
contain  few  subjects,  the  probability  of  measuring  the  correct  value 
of  p  for  the  population  would  be  small  since  the  sample  would 
frequently  contain  too  few  or  too  many  members  of  the  poorly 
performing  subgroup. 

The  results  of  Section  1.0  can  be  used  to  determine  the  number  of 
date  collection  sessions  required.  In  predicting  the  time  averaged 
performance  of  an  access  conrol  system,  the  worst  case  would  arise 
when  the  hidden  parameter  y  took  on  only  two  discrete  values.  When 
y  *  ,  the  subject  has  p  =  p^  =  0.0.  This  case  occurs  99%  of  the 
time,  so  G(y^)  =.99.  However,  when  y  =  y  2,  the  system  performance 
degenerates  to  a  value  of  p  =  p?  =1.C,  with  G(y2 )  =  .01.  The  time 
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aver  aged  Type  I  performance-  is  .012  and  still  meets  specifications. 
The  hidden  parameter  y  which  governs  the  temporal  variability  of  x  is 
now  analogous  to  the  hidden  subgroup  which  governs  variability  over 
subjects.  We  can  thus  state  that  iS6  different  data  collection 
sessions  are  required  to  assure  that  any  hidden  parameter  y  dees  not 
possess  statistics  which  will  make  the  predicted  Type  I  error 
erroneous.  These  sessions  must,  of  course,  be  collected  at  times 
separated  by  an  interval  such  that  y  will  have  a  high  probability  of 
cha-nging. 

The  number  of  sessions  is  appallingly  large.  However,  by  making 
some  reasonable  assumptions,  the  number  can  be  reducec.  If  we  assume 
the  temporal  variability  in  measured  attribute,  x,  is  the  same  for  all 
subjects  (only  one  G(y))  but  is  uncorreleted  in  time  between  different 
subjects,  then  the  result  over  many  sessions  can  be  inferred  from  the 
results  over  many  subjects.  For  example,  suppose  we  have  ACC  subjects 
who  ore  enrolled  ct  one  session  anc!  tested  at  a  l  ter  session  with  a 
time  interval  long  compared  to  the  time  for  variation  in  x.  If  there 
were  a  hidden  parameter  with  the  .Cl  probability  of  occurrence  end 
p  =  1.0,  then  the  most  probable  occurrence  is  for  1  i:  of  the  subjects 
to  be  rejected.  As  we  show  in  Tabic  A3,  the  AGO  subjects  permit 
determining  of  p  of  .01  with  almost  905;'  confidence.  Thus,  if  one 
satisfies  the  requirement  on  number  of  subjects,  he  will  also  satisfy 
the  requirement  on  sessions  if  two  sessions  (counting  enrollment)  are 
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PAR  SPEECH  PROCESSING  CPSP >  SYSTEM 

The  PAR  Speech  Processing  System  is  a  flexible  and  easily 
expandable  system  within  which  a  variety  of  speech  processing  tasks 
have  been  implemented.  Data  is  stored  in  files  in  established 
formats.,  processing  is  carried  out  by  independent  tasks  operating  on 
these  files,  each  implementing  a  single  function. 

There  are  four  basic  types  of  files:  waveform  files,  containing 
digital  speech  data;  encoded  data  files,  containing  linear  prediction 
encoded  speech  data;  phoneme  library  files,  containing  the  phonemes 
used  in  construction;  and  covariance  files,  containing  covariance 
matrices  for  phonemes  used  in  phoneme  recognition.  Figure  B1  shows 
the  different  file  types  and  lists  the  programs  and  functions  as  they 
are  related  to  the  files.  The  following  is  a  short  description  of 
each  program. 

Record:  This  task  digitizes  on  analog  speech  signal  using  the 

LPA11~i<(*>.  The  sampling  rate  is  12. £  kHz  and  has  12  bit  <♦/-  2048) 


(*)  DEC  PDP  11  series  laboratory  peripheral 
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WAVEFORMS 


Record: 
Playback : 
Encode: 
Edit: 

01  splay: 

Dump: 

Scale: 


Decode: 

Change: 

Oi splay: 
Dump: 

Construct: 


Enter: 
Delete: 
Dump: 
Average : 


Digitize  speech  signal 

Convert  digital  waveform  to  analog  signal 
Encode  speech  into  cross-sectional  areas 
Extract  portions  of  a  waveform  file 
Display  raw  or  processed  waveform 
List  out  data  values 
Scale  waveform  to  12  bits 


ENCODED  DATA 

Decode  cross-sectional  areas  into  digital  waveform 

Modify  voicing/  pitch  parameters 

Display  time  history  of  cross-sectional  areas 

List  out  data  values 

Construct  an  encoded  utterance 


PHONEMES 

Enter  a  phoneme  into  a  library 

Delete  a  phoneme  from  a  library 

List  the  phonemes  in  a  library 

Average  many  frames  and  enter  into  a  library. 


Covariance: 
Invert : 
Classi fy: 
Dump: 
Delete: 


COVARIANCE  MATRICES 

Calculate  a  covariance  matrix  for  a  file 
Invert  the  covariance  matrices  in  a  file 
Preliminary  classification 
List  entries  in  a  covariance  file 
Delete  entries  in  a  covariance  file 


Figure  B1 
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PAR  SPEECH  PROCESSING  (PSP)  SYSTEK 

accuracy.  Data  is  stored  in  waveform  files,  as  2  byte  unformatted 
integers,  512  bytes/block.  The  start  end  sto^  of  digitization  is 
under  operator  control  and  the  duration  is  limited  only  by  the  largest 
contiguous  space  on  disk. 


Playback:  This  task  pleys  a  digital  waveform  back  out  through  I 

] 

the  LPA11-K  at  12. £  kH2.  Start  of  D/A  conversion  is  under  operator 

i 

i 

control.  The  file  can  be  auditioned  repeatedly  or  t  new  file  can  be  j 

i 

.  I 

auditioned.  i 


Encode:  This  task  encodes  a  digital  speech  signal  into  linear 
prediction  coefficients.  The  output  file  contains  a  frame  label,  if 
known,  frame  voicing,  pitch  period  (if  voiced),  gain  factor,  fifteen 
linear  prediction  coefficients,  fifteen  reflection  coefficients,  and 
fifteen  cross-sectional  areas.  These  ere  4  bytes,  un  f  o  rma  1 1  e-d  . 

The  encoding  employs  the  aut o-cor r c l a t i on  method,  as  explained  in 
Section  3.3.2.  The  computation  is  carried  out  using  Robinson's 
recursion  [ID.  The  reflection  coefficients  which  are  intermediate 
results  of  this  calculation  are  used  to  calculate  the  cross-sectional 
areas  using 


A 

1+k 

m 

m 

,  m=l,  2,  3.... 

A  , 

1-k 

m-1 

m 

A,.  =  1 

M 

where  A  are 

the 

cross-sectional 

1 

> 

i 

I 

* 

and  k  are  the  reflection  l 
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Voicing  is  detected  using  s  cyclic  auto-cor relet ion,  which  is 
calculated  by  taking  the  inverse  fourrier  transform  of  the  power 
spectrum.  This  function/  rc<n)/  is  searched  for  its  maximum  between 
n*2  and  n=256.  If  rc(np)/rc(1)  >  /  where  *  0.35  and  is  the 
location  of  the  peak  value,  then  the  frame  is  called  voiced,  with  a 
pitch  period  of  P  =  np/fs,  fs  is  the  sampling  frequency,  otherwise, 
the  frame  is  called  unvoiced  with  P  *  0. 

Edit:  This  task  allows  a  section  of  a  waveform  file  to  be 
extracted  and  placed  in  another  file.  This  is  useful  in  eliminating 
the  long  silences  before  ana  after  utterances,  and  in  selecting  short 
portions  of  long  utterances  for  processing. 

Display:  '..aveform  files  can  be  displayed  on  s  Tektronix  4014 
storage  tube  display  terminal  in  two  formats.  The  raw  data  can  be 
displayed,  with  only  the  frame  boundaries  and  frame  numbers  marked. 
If  the  file  has  a  corresponding  encoded  data  file,  then  the  trame 
label  (if  known),  voicing,  pitch  period,  ai.d  frame  number  are 
displaycc.  Both  displays  are  1C  frames/line,  4  lines/page  (see  Figure 
S3  and  Bi). 

Dump:  This  task  simply  prints  out  the  actual  data  values 
contained  in  a  waveform  file  (Figure  84). 

Scale:  This  task  scales  date  from  greater  than  12  bits  to  12 
bits.  It  oots  not  scale  deta  up  from  less  than  12  bits. 
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Decode:  This  task  is  complementary  to  the  Encode  task  in  that  it 
creates  a  digital  waveform  from  an  encoded  data  file.  The  linear 
prediction  coefficients  are  used  to  design  a  digital  filter  whose 
excitation  is  a  pulse  train  for  voiced  speech  or  Gaussian  distributed 
random  noise  for  unvoiced  speech. 

Change:  This  task  allows  Ihe  user  to  modify  the  voicing  decision 
and/or  pitch  period  for  any  frames  in  an  encoded  data  file. 

Display:  This  task  displays  the  time  history  of  any  one  of  the 
fifteen  cross-sectional  areas  as  a  bar  graph.  There  are  ten  frames 
per  line,  A  lines  per  page,  and  the  display  is  labeled  with  the  frame 
label  (if  known),  the  voicing,  pitch  period,  and  frame  number  (Figure 
E5). 

Dump:  This  task  lists  out  the  data  contained  in  an  encoded  data 
file,  frame  by  frame  (Figure  B6). 

Construct:  This  task  constructs  an  encoded  data  file  according 
to  a  string  of  phonemes  specified  by  the  user.  The  data  values  used 
to  construct  the  string  are  gotten  from  the  appropriate  phoneme  name, 
a  relative  factor  for  tho  pitch  and  gain,  end  the  duration.  Control 
cf  the  pitch  and  gain  and  duration  gives  the  user  control  of  the 
prosody  of  the  utterance. 
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The  construction  program  reads  the  phonemes  out  of  the  library  by 
pairs.  Starting  with  the  first  and  second*,  it  first  duplicates  tnem 
for  the  duration  specified  in  a  buffer.  Then  transitions  are 
calculated  for  the  cross-sectional  erees,  gain  and  pitch.  Then  new 
values  are  calculated  for  the  linear  prediction  and  reflection 
coefficients  arc  calculated  as  the  first  phoneme  is  written  to  the 
output  encoded  data  file,  frame  by  frame.  The  third  phoneme  is  then 
read  in,  duplicated,  and  transitions  between  it  end  the  second  phoneme 
are  calculated.  The  second  phoneme  is  output,  and  the  procedure 
repeats  until  the  last  phoneme  is  output.  Such  a  constructed 
utterance  con  than  be  decoded  and  auditioned.  This  is  the  speech 
synthesis  task.  Figure  S7  shows  the  utterance  construction 
processing. 

Enter:  Phoneme  values  can  be  selected  from  an  encoded  data  file- 

end  inserted  into  a  library. 

Delete:  Phoneme  entries  in  a  library  can  be  deleted. 

Dump:  The  phoneme  entries  in  a  library  are  listed  out  by  this 

task  (Figure  B£) . 

Average:  Many  frames  of  an  encoced  data  file  are  averaged  by 
this  task  and  entered  into  a  library  as  a  phoneme.  This  is  useful  for 
phonemes  that  can  be  mode  as  susteinec  sounds,  such  as  vowels,  nasals 
and  fircatives.  A  waveform  consisting  of  only  one  susteinec  phoneme 
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PaR  SPEECH  PROCESSING  (PSP)  SYSTEM 


is  encoded,  then  averaged,  then  entered  into  the  Library. 

Covariance:  A  library  containing  at  Least  fifteen  different 
occurrences  of  the  same  phoneme  is  used  by  this  task  to  calculate  a 
covariance  matrix  for  that  phoneme  and  enter  it  into  a  covariance 
matrix  file/  along  with  the  mean  value. 

Invert:  This  task  inverts  the  covariance  matrices  in  a 
covariance  matrix  file  and  generates  a  file  in  the  same  format,  but 
with  the  inverted  matrices  in  place  of  the  covariance  matrices.  The 
matrices  are  stored  in  upper  triangular  column  form  since  they  are 
symmetric. 

Dump:  This  task  lists  out  the  entries  of  a  covariance  or 
inverted  matrix  file  (Figure  C9). 

Delete:  This  task  deletes  entries  from  a  covariance  or  inverted 
matrix  file. 

Classify:  This  task  uses  the  inverted  covariance  matrix  file  to 
nominate  phoneme  names  for  each  frame  of  an  encoded  data  file,  using  a 
Ka ha l snob  i  s  weightec  nearest  mean  vector  logic  C23.  The  phoneme  names 
arc-  inserted  into  the  label  fields  of  encoded  data  file.  The  user  can 
use  these  as  a  guice  in  making  the  phoneme  selection  for  entry  into 
the  library. 
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Any  of  these  tasks  may  be  altered  without  affecting  the  file 
structure  or  other  tasks  and  any  new  tasks  may  be  accec,  using  the 
same  fles,  and/or  creating  any  new  files  needed.  This  is  the  key  to 
flexibility  and  extensibility.  Figure  &10  shows  the  general 
processing  flow  in  this  system. 
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